For Reference 


= 
) 
cS) 
ce 
es 
x 
& 
= 
cS) 
ce 
at 
Z, 
<3 
ha 
< 
a 
| 
69 
) 
=) 
& 
S) 
Z 


Gx AIBRIS 
UNINERSTCACIS 
a RIAENSIS 


HINA {| 
ee _ ere 
2, Rreny 4 
EY 


The University of Alberta 
Printing Department 
Edmonton, Alberta 


=~ 


Digitized by the Internet Archive 
In 2024 with funding from 
University of Alberta Library 


https://archive.org/details/Bickis 1973 


Poon Us LoVe Rk coeieds a OLE SACL Boke ak 


RELEASE FORM 


NAME OE AUTHOR. Agegehene GonPhoets 


TITLE OF THESIS Information and Markov processes 


DEGREE FOR WHICH THESIS WAS PRESENTED ... 2S¢¢7, OF Science 


YEAR THIS DEGREE GRANTED . Fee ae tpetiliee Baa Bom ies Rs 6 Hes 
Permission is hereby granted to THE UNIVERSITY OF 
ALBERTA LIBRARY to reproduce single copies of this thesis 
and to lend or sell such copies for private, scholarly or 
scientific research purposes only. 
The author reserves other publication rights, and 
neither the thesis nor extensive extracts from it may be 


printed or otherwise reproduced without the author's 


written permission. 


- a wie f ; — a 


i) 
Tey eT Vea_ 20 rrsaee hen 


ily Seana 


viewed a obese 


da oe le Ube e Oe Pe ested wee BR Obes Fe eae y 4 eee fever ehw ha cae 


iiweeseng voved! brow oc hderotal 


Tere rereeorrreee ret Cee eee ae Oe Oe Bee 


ewes ces WOW sn Ut ends sme ree Neds an + an ee tees ey ess 


oe eed ww TAT Pee d= 8 O44 SODA Le SABO EHP EE ee eee 


ee _, Seite tee le Tee | iris: tal urea @psitd ar ee : 


£o?i. ~ bow 
eer he LE Cae De i a 1) 2tut Gay 
6 “CTeery itt? cay oo. begeeus, eval «f nos webered 
vies) ot-M te aatqu® SSR0Rs acubopread of SUAS ATTACIS 


THE UNIVERSITY OF ALBERTA 


INFORMATION AND MARKOV PROCESSES 
BY 


)MIKELIS G. BICKIS 


A THESIS 
SUBMITTED TO THE FACULTY OF GRADUATE STUDIES & RESEARCH 
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE 
OF MASTER OF SCIENCE 
IN 


STATISTICS 


DEPARTMENT OF MATHEMATICS 


EDMONTON, ALBERTA 


FALL, 1973 


Ye 


@INOT8 0 assur 9) 


eragut A 
HomeIeay 2 SeTOTS ATAUGATO To YXTUOAT SAT OF GAT TENE 
FAA ANT 2OT STMAMSATUORA FHT FO C4AMLITTIVT LATTAAT VT 
fovrroe TO aarsan I 
#1 


eorrerrate 


2DETAMSHTAM 2O THAMTAAIAG 


THE UNIVERSITY OF ALBERTA 


FACULTY OF GRADUATE STUDIES AND RESEARCH 


The undersigned certify that they have read, and 
recommend to the Faculty of Graduate Studies and Research, 
forracteptance.. a thesis entitled 1.2. «teases ri ete ara eis mere a 


INFORMATION AND MARKOV PROCESSES 


eoeoeoeeer eee eee eee eee wee ee ee ewe eee eee ee ecoeoeevreee eee eee eee eoeeee 


submitted by 20 80.66. CO 60 (S020) 2.0 6: 8, 6 81879 2S: S016 BO 8 6) G':O 20.6, CO S Be B' 10" 86s) 8. 
in partial fulfilment of the requirements for the degree of 


Mae CeyT OL py wiGUenGe «aicnisiale nate 


dete teen, Pedal, Gals: whit sid Saat ain eet ant | 
» tt oeeeeer baw ehtlned neers, a edb 3 Levee — 
Sabian anrage as 


- : 
a 


7 
a 
. in ap Ee 
A . — S —_———— 
————_$__—— = 
7 


Dedicated to my parents. 


: ; : 
: x~ 
 ,adoszaq. ym 94 boi ettho 


ABSTRACT 


Several concepts of information can be defined on algebras of 
events, and can be related to probabilities. Two very useful concepts 
are entropy and discrimination information, with applications to communi- 
cation theory and statistical inference, respectively. Conditional infor- 


mation can also be defined, given an event, or sub-algebra of events. 


A historical summary of information theory is given in Chapter 


II, which includes several generalizations of the information concept. 


The properties of conditional information are employed in 


reaching a general result concerning the information in a Markov process. 


(v) 


ss > - > = 
aiqeaves Lvloaw yaov own Roky ttidedoxe 64 bezalor ad cup Bae «asheve 
aa) 


~tanmo> of enotieotlags Wilw ,solsatyolnt nobsverkwtreth bas yQosss 
~yolat fanoltibno) .ylewttneqae? ,soneagtat Sealzelssze box yrosda a aan 
-agpeys to eidegte-due 10 ~inevs ns nevig ,bankieb od ols a8 not ea 


=e 
* » 


a9dqen) al mevig ed yroods sokgemrotat to \remme Isstrotakd A 
-dqeatos nobiamrotat #12 lo enotiestistaesg Isrevea eSbulont doa, IT 


at bevolqms ots moliamreint [anotythaos Yo aelairsqo1g ont 
-saen0Tg votrRM 6 ob aolgamzoiat sfj gotarsamq> diyast fexsaag 8 gatdneer 


I want to thank 
standing and his patience. 
fruitful discussions, and 
Jane, without whose moral 


euLc. 


I would like to 


University of Alberta and 


ACKNOWLEDGEMENT 


Dr. S. Ghurye, my supervisor, for his under- 
I want to thank my friends and colleagues, for 
their encouragement. And I want to thank my wife, 


support the task would have been far more diffi- 


acknowledge the financial support of the 


the National Research Council of Canada. 


(vi) 


i 
‘ade aud 102 woakvivque wa tei as x0 al ot re 7 a 

01 ssauguotioo bia shaokst vo seo i ‘siiaie 3 1 souskisy, aid —_ oes ya 

any qe mga os sinew 1 bed arses nears bas amen 

= RDEED oepen end used svod latow ils ata sroagee Lox0m sud 3 rotate | aa 

7 ae ty 


7 
7 7 =~ 
: a : 


ait to Jioaque a", ats onbe Twonstan of sak iivow ai * -_ 


-Bhsap2 Yo Lronuod dotasaes Kenotset-adt Gn. . eaodta to te avis 
; _ a 


oe 


CHAPTER I: 


CHAPTER II: 


CHAPTER IIL: 


TABLE OF CONTENTS 


ENTRODUCTION se cence, aoe a Ait ee eae. 


THE CONCEPT OF INFORMATION 


Probability and Information 4.. 2 ..+ 
Sub-algebras and Conditioning 


Entropy and Discrimination Information . 


SS ety ep 


- Properties of Entropy and Discrimination 


Lightenginskerkeyo.) Geeg ey osc sop) oO 6 dc. 58S “c 


A BRIEF HISTORY 


1. Information and Communication .. 
2a lnrormmations andsJnterencess 4in nee 


3. Various Tangents 


INFORMATION IN MARKOV PROCESSES 


1. Basics of Markov Theory 
2.,Two Simple, Examplesss .48 sine sum 


3. A General Formulation 


REPERENGH Os yibietmeear ss cn cleera) : Sue Benne unsere 
APPENDIM CL Sos tes avec wen eens. ol Valiente 
APPRENDI) ees So eee tert ys oe ee ees 


(vii) 


Page 


14 


24 


29 
35 
36 


40 
46 
49 


58 
62 


64 


tg x 2 NT noksegtmmmoo bac oobsemrodat 
PS er eae sonsseto! Ene sotzemrbial .s 


* 6° 2. 8 © © «© * &§ £8 £ © &© @ & * 


Cr ee ee | 


s & - oe : 
ee. re *_ + * «© @ . +... £ XrgmaatA 


em a eae EN SF! _* © © © & . & ktquswa, 
_ 


INTRODUCTION 


In this dissertation, we will be examining concepts of informa- 
tion and probability, and some applications. Working from an intuitive 
basis, we will consider how the two concepts are related, and will look at 
how concepts of information can be utilized in communication theory, stat— 


istical inference, and Markov processes. 


In chapter I, we introduce measures of probability and informa- 
tion on an algebra of events. Some properties of functions defined on a 
family of probability algebras are examined. Two such functions are 
entropy and discrimination information and their characteristics are listed. 
The contents of this chapter is a reformulation and unification of diverse 


results. 


The chapter II various ideas concerning information are collected 
together from several sources. Applications in communication theory and 


statistical inference are considered. 


In the final chapter, we are concerned with the information for 
discriminating between two Markov processes. After some preliminaries, a 
general result is proved for processes of the jump kind. We define a 
concept of infinitesimal information, and relate it to the discrimination 
information contained in an interval of time via an operator equation. The 


contents (except for some expository preliminaries) is original. 


Two theorems, which could not be found in the literature in the 


desired form, but which digress from the main path of the dissertation, are 


included as appendices. The proofs are our own. 


-l- 


7 7 — 
me + 
oe 7 
uryatat 40 asqaoa0s jeshusseg onan ae 
Svizingni of mo7i gnidsow -anokysaliqqa: amoz ‘bas ‘Acoma a 
gh Aoel LLiw bas ,bsialex xb e3qs9009 awa ail3 Wed abHEDOD TEN OW « 
~$sja ,¥xoat? gotissinummos ni bestitiw 9d bo aoksamrelnt to aie 


‘is etin saellale ones ena Inet : 


-—girrotai bos yilitdedesq Yo estwerem Ssouboysni sw ,1 sedquds at 


as 
iz 


a mo bentish esotisav? to avlsuaqor Suoe .aitave to e1dogis a ~ 
#38 enokionut dove owl .bankuaxe o76 esidagts “3iiidedosq to Led 
.b89aFL ata toldel sad iiteads aiods bas ootzemroink a aac bos. yeorsa9- - 
oarsvib Yo motissiiiw bos notiptumzoisy B al aatgsdo aida te @ans3moo oat 


basoullo> sta moltemroln! galmrsono eashi evolzay ID astqedo sdf 
bos ytosds notisbimimnos al anvltaotlyqA -asotwar Isyover mort 1 


-betSbieao3 815 sonezetnt | 


& ,sakusntmbiexg omen xsdTA  .2eeemn0xq vous ows os gm P 
s onlish 3W «bat aot ons to erenaserg toe i 
7 aotinoinirseth oda od 3h, outa brs (fo besry on erry? 


ont a Shaan ha nav sk 3 ie fovredat, as al be 
I: es a (eotneateion sptcainaess smioe 


- ‘2 


CHAPTER I 


THE CONCEPT OF INFORMATION 


Tks Probability and Information. 


Information is concerned with knowledge, and knowledge is derived 
from observations. In real life, all our observations are limited in pre- 
cision, and our actions are finite in number. Mathematics has invented real 
numbers, which greatly simplify analysis, but no empirical scientist has 
ever observed '"'a real number". Rather he observes something in an interval 
whose size is determined by the precision of his instruments. But we can 
consider the real numbers as being "limits of precision", in a sense defined 
so as to "complete" the number system, either via Dedekind cuts, or Cauchy 


sequences. 


Similarly, few Borel sets can be "measured". Rather we can only 
measure unions of a finite number of intervals. The "events" of classical 
probability theory are again limits (in some sense) of empirical events, 
invented for both aesthetic and utilitarian reasons. In the limiting 
process we give birth to the orphans called "sets of measure zero", which 
are both possible and impossible. But the fact is that these sets are not 
in fact observable, so it is a probabilistic enigma to call them events. 
Rather they are reminders of the fact that we put no upper limit on our 


possible precision. We can accomplish the same be talking about a measure 
algebra without atoms. 


We will adopt the philosophy that any concept is the limit. in 


some sense, of that concept defined on finite systems. As much as possible 


/. oe 
beviesb 2} sgbsfwacd bas \sabelwond dotv beisestos ah nolsemoinl 


> Ga oe 


~s1q at beaker! sxe anotssvisedo Two Iie ,etif Inet ot senoliavisedo work 


inet fovasvak aed eoidemed2eM ©. redewa nt o2leld orn 2vootion wo bas saotets 
ea Jekinetoe Isckakqms om aud ,ekaylena ytitquia ylanemg doidw , arsed 


aes ow Jud .et@emersent ef Yo nokatosrq ails ed betlmreseb at oshe aaode 


tavieini an at gatdasmoe eovieed6 Bt radtat ."yediuun, Las ‘s'' bavieeto save 4 


hontieh secse o af ,"nokelostq Xo adiottl” gited «a essdoue feos ofa gebieapo 
viounD To ,@3u0 batdebod piv 1sdiis ,weJdaye sodmunm aft “sjeigmos” of es/os 
,e9oneupes 


vine any ow veddsa ."boxwaasm' od as> etse Ioioh wal ,yfxettate ol? 
feoteests Yo “wiusve" oll  .elsvrszat do t3daun stink? 5 Yo exolne ssyeae ~ 
,asnavs Lsotutqme to (sansa sane at} axtmit tog = roads yal tidsdosg : 
gsitotmt! oda of .anoasa narra blhsu ban 2) taod rot badaovat | 
dokiw ."otss S1yansa to 2390" bailing enssgro add eg Haped avig ow aassorq 
300 978 ajar saod? Jat Bt 398% ond wa .oldteeogmt ban oidlasoq ete 
siceladl mets fino o3 semins pisebbrdaders & #t 3k oe ,ofdsvraedo ne 
wwe mo Timkl r9qqe on Jvq Sw Jedd 3563 ait a0 arshoises sxs 7 ; 


swanam ® tueds gainer et, sam ads dailqgmooss app ov aaa 


° 


elle 


we will consider an algebra of events, rather than a "point set", as primi- 
tive. This "algebraic" approach to measure theory can be traced back to 
Caratheodory [4]. It has been applied to probability theory by Halmos [19], 
Birkhoff [2] and Kappos [25,26]. A fuller bibliography can be found in 
Kappos [26]. Information is defined on events and algebras of events, so 
that the "points" would for the most part be unnecessary baggage. However, 
there are times (particularly in the final chapter) where we would like to 
make use of standard results in analysis, and then we will represent our 


measure algebra in (%,A,u) style. 


Vek There are several interpretations of the concept "probability" [3], 
but for the moment we will adopt the subjectivist one that probability is 
a measure of belief or expectation. Mathematically, we can consider a 

class of events, or observations; the "more likely" observations having a 


"ones. For any two events, we 


higher probability than the "less likely 
can conceive of their logical conjunction and logical disjunction; and if 


we adopt the convention of a sure event I and an impossible event 0 , 


then our class has the structure of a Boolean algebra. 


Let us call our class of events E . Our probability can be 
expressed as a function p<: ¢ ~ [0,0] ‘euch that p(0) = 72 pil) = 15, 
Pleas ioe (0,1) and AAB = Q => p(AVB) = p(A) + p(B) , and hence the 
couple (E,p) defines a measure algebra. Denoting by "+" the symmetric 
difference on E , we find that 0(A,B) = p(AtB) is a metric, and to make 
things neater we usually talk about E , the completion of E . p can be 


extended to p on E by continuity, and it follows that p is continuous 


in both the metric and the stronger, order topology [26, chapter 2]. 


(102) aumlal yo yxeeds pagers aie oie nite Le) erobeiae 
at biel sd mmo ydqnigolidid s9ltva A prarephdpcnpsat 
oe ,inave To nosdegie hee ataeve co bantieb ef notzstaoiel . [28] sogqet 
_revawoll Laypaged (aoosszenmy of 3109 3H0n ts x03 bnew “eantog” a st 
oF s¥tl blodw sw srsiw (xesqpils feati ado mi ylreLoigaag) aemks oun 
wo jnsevtqa7 ILiw ow meta bos ,eteylata al asluee7 brabasise to ore 


.btvae (7A) at exdegis s1ueasa 
i ary 


7 Ag tS 1 
.{€] “yatitdsdorq” aqeonno Ady to snotisdeyqysiat Lereves or ered? ey § 


et ywllidadiung tsid $40 telvtios {due ‘ent aqobs thw ow “Amanien ont? ‘outs aed 

s tabhenos nev sw .vitsobtemetieM -nottetssqRs xo adpin 0 herent 

& gnivel anotusvreeda "yLloalf atom" edz ,anottavisedo to .etasve to weeks : 

ov ,ethavs owt yin OF .2900 Wicket h aes 2esi" od3 nos yotitdedorq wedgid 

Vi bas snélvonulekb Lsotgol bas nokioautwoo Lnorgol stand in avbuoaes saga = 

. & gnsys s{dieeoqmt os toe 1 Jheve azine 2 Io noloosvacs of¢ tgobe av | 
oxdegia néslood & to sryjoutse ous ead eeelo — 


ele 


sd neo Ydlitdedaxqg 00 . 3 sianve to santo wo tieo sv tod rf 
, £= (Pq .-0 = (0)q jurld dows fio) =~ 334 monary 
wit3 sod Bah. (ADGo# (RDG = KRVADY <Q = BAA bar. 1.0) > HL.) = 
abrseminge oes — Teigulnwent sstdagis ouecom « aanbiyb a ae 
sdam 03. bos obra, 6 BE (G+A)qe= (4,A)q indy bal? ow , 3 ae an 
om q .9 Yo etsetqnon: as «3 itn ol a 


- 


evouinisaos ot Ganda ano aE hare babe bles | 


Note that we have required the probability to be Strictly posi-— 
tive, i.e., there is only one event of measure zero, the impossible event. 
(Such a convention is analogous to the situation in set theory, where 
there is only one "empty" set.) In relating this approach to the classical 
theory, it must be borne in mind that our events are modulo the o - ideal 
of null sets - i.e., we identify sets which differ only by a set of measure 


zero. 


A "probability" which is not strictly positive we will call 


improper. A "probability" for which p(I) < 1 we will call defective. 


The completed algebra © isa o- algebra, and the probability 


p is oO - additive. All probability algebras in this dissertation will be 


assumed to be compiete in the above sense, and hence oO - algebras. 


If the elements of E are thought of as propositions, then [I 
represents a tautology, QO a contradiction, and the other elements are con- 
tingent propositions. Elements which are "smaller" (in the lattice sense) 
represent more specific propositions, and people are often interested in 


making their statements more specific. 


Lee Probability is, of course, a monotone function on E , and it is 
reasonable to expect that any function on E measuring information to be 
monotone decreasing, as one would think that more specific statements 


convey more information. We can draw a closer link between the concept of 


probability and information if we make a vague appeal to psychology, and 
maintain that the occurrence of an unexpected event (or equivalently, the 


confirmation of a dubious proposition) appears to convey more "information" 


fasbk ~ 0 92 olubom oxs ee neces grod sd Satin ¢ 
sate, Yo 200.1 48 ie 908% dan sie <I8andb on, esi ~ ai A 3 a 


Lins SLby ow svistaoq yisabaae 20a at dodldu “yabtidedozq® A _ 
“evitosieb Iso Likw ow £ > (X)q Hod vor “yslttdadorq” A sroqosqmr 


y3ifidedor sf? bow ,ardegigs - 0 28 at 3 bidegin. hosaiquos sft ~ — ; 
od [lbw nogses1oaetb atd? ot setdagls yailidsdozq TIA ,ovtathbs ~ 2 mq 
vagxdagié ~ 0 asad bas {senea oveds sid nb saeiqmos ed 02 bemgnas | 


I nedy ,enstiteogoxg 2#@ Jo sdguondt ara YT Yo einenets ofa WT =e 


-no9 978 ednomels Ysilso od3 bas ,nolsoiberatos s 0 ,¥golouony & singee2ger ; 

(saqoe o>tasal ofa at) “velfewe” ssa lotdw esiemsla avila teddy © 
nt batestoInt A93io ste olgueq bas ,enoliteoqorg Ii ttoeqe stom 3a9267q97 . 
sotitosqe stom ajuamataya thedd ory 


at 3b bow , 3 aa notsonyt Sn010n0m & ,seines io ,eh yoliidedort SE 4 
of 03 uotanarotat gntsuisaon 3 ug.antaai ene. ania saeque 02 efdawonawt 
8am I ea 9 oiosye| std its Sold bluoy ano ex, -gntnneash onodoamR 
30 seponon ls miawrnd dahl anents w vont mo 9 ee eer 


than the occurrence of an event we were expecting. 


One can thus postulate that a measure of information should be a 
decreasing function of the probability. It is reasonable to expect that 
the information provided by independent events should be additive, hence a 


reasonable measure of the information of an event would be 


J(A) = -log p(A) ; Oe eee are Be) 


3 One can deduce (1.2.1) axiomatically [24]: Let J be a non- 
negative, decreasing function on ig such: that... J(0) =.°.. si(D)a= 0... ) Let 
us also postulate the existence of a finite or countable family of subalge- 
bras iF which is such that for any family of events {A} such that 


INE Tse 
O a 


AA #0 . (31) 
Q 


Such a family is called M - independent. Intuitively, this means that 
elements of different Fete are mutually compatible, so that presumably 


information about the one gives no information about the others. 


Suppose now that J satisfies the following axioms: 


For is age 80 i ) Ca eae ee 


There is an operation T on the positive extended real axis, 


guch that for any A,B ¢ E such that AAB = 0 


J(AVB) = J(A) T J(B) : Cle aa) 


2 


Va ignbivoqns sav ow aidigaesbnetl ~o 
7 @ ; 
e sd biuade net sngnotmt to otuasom a Jetty saelusoq end na 900 


sass Joegqxs 69 olidanoinet et 31 akiidedorg, ads 20: noraoat an 


‘ = 
ne 
» sonsd ,svisibbe sd blvods adneve snsbragabsit vd bsbiverq ctseaata of 


a0 bluow faeve ae io ao laemrololt efy lo sueeem Sldp, 


-nor pn od UL tad o2f8S] vileolosmotcn (1-S0l) soubeb nsp 9nd ae € 


jaf. O'= CU , 4 = (O)L tet doue , 3 40 qoksouw? gitesetosbh ,svigagea 


CT 
| a 
-sgladve to vlimel sldesavon to 9itatt e to sonatetxce of} sieiujaoq coals au A 


4 


g55 dove [Al aitnevs io yiiom? yon 16% Jad? done ei dobdw (ir anid 
> 

Yn 

‘ oo 3 of 


(Ene) - O&IAA 


0 
an 
yeild aheow ats) .yisvighrinl .Josbnsqebut - M befis, as yiteni o 3 
ee 
vidvauessq tst3 on ,sldtiaqmoo yLieuium sis 2' 1 aszstikb Yo Pt) 


.2xed20 9f3 Juods nokjsmrotat ‘on aevig env ata _— 


- _ 


a S 7 _ 
jamotxn gadwolto%, ony cpptintana i sit os saoggee ; 
; : oF 7 : 7 _ a ‘ 
(8.8.2) : (At { n t A vir oe a _ i> 4 “aed 
a e cai ela — 


a inst babnedxs ping: ne 2, eae sibel IO 
ra wo wiigsh 


ae yess | ee pera 
| 7 

ce a a ey 

. ap bei y urs - a 


(«a - 


a 


Tf age rest A ee oT then 
O o 


J(A,) + [J(A,AA,) T J(A, AAS) Z J(ALTA,) 


STNG J(AAAL) 1 1 [J(A,)+I(A,AA,) ] rT [J(A,)+I (AAA, )] , tare 


Axiom (1.3.4) is somewhat technical and represents a limited distributive 


law. 


Not all operations T make the axioms consistent. If we also 
require that we can assign information values within any sub-algebra ne 
with no regard to the values in the others, then T can only take on two 


forms. 


x T y = inf(x,y) Qe eB) 


~x/e 4 See) : (1556) 


x T y = -c log (e 
If T is of the form (1.3.5), then there exists no probability 

measure p on E such that J can be represented as a strictly decreas- 
ing left-continuous function of p. If T is of the form (1.3.6), then 


there exists a probability measure p on E such that 


J = -c log p ‘ CPi) 


1.4 Equation (1.2.1) defines the information gained by the confirmation 
of an uncertain event. We can generalize this formula if we say that after 
our experiment, the probability of A is now q(A) . (This generalization 


makes sense in either the subjectivist or logical interpretations of prob- 


(ots t) Sgagsteyen f + tgs i 
¥ 


evigydisielb bssimil oinbedtnt ie Fayed sodwemor al ace 


ots ov 1S .toadehenos enoixe afa‘einen 1 enokjasede Ls, 40M 7 
s) srdsgie~die yoa nlidiw eaulay sin temro3at mgleen peice 
OW) fo sss tise m3 4 oett wereidzo add. at cava odd 02 beager 


' 
- 


(26.5) (y,x)ank = ¢ T x -_ 7 


(0,€.1) ave, - a\e 5 gpolo~ = ¥ Tx 
Vuttdsdorg on etetxs-sxedd nets ,(2.£.2) mod ada do ak TF UW 


~a2S9129b punedeny ss WA L aeiaisl 3) 7 oa 


ability, but it has no frequentist interpretation.) Then we can say that 


the information transmitted by this change of probability is 


J(A) = log q(A) - log p(A) : Cla) 


This quantity is properly described as information in favour of A , for 
- Ar i , 

hae ayy is confirmed, and hence q(A) = 0 , we have J(A) = -~ ; but we 
have not lost information about A , we have merely lost information in 


LAvVOUTLOL ik. 


In (1.4.1) q(A) is different from unity only if we do not 
directly observe A . We observe some other event B , and we know the 
stochastic link between A and B , and hence we can calculate q(A) ; 


i.e., if we observe B , then q(A) is the conditional probability 


p(A|B) = p(AAB)/p(B) 


Formula (1.4.1) also has another interpretation (which can be 
reduced to the former under a Bayesean model). Note that it is in the 
form of a log likelihood ratio, and hence we can interpret it as the infor- 
mation in favour of q against p , provided by the observation A. The 
connection between these two interpretations is formal more than conceptual, 
but we will see a closer connection in the sequel. We will call the second 


interpretation discrimination information [28,29]. 


2. Sub-algebras and Conditioning. 


So far we have been concerning ourselves with individual events. 


Let us now extend our horizons. 


jal you Oa sw DeAT {eens a 
eneeereeingere oe 


(adel) » (da gol aa a 


10% , A %o wWovel om) adlsseroint 46 badivoeal efvaqosg ek ae a a 
ow gud ; ™ = (ADL avert ow , 0 = (A)p sonad bak 4 dastanee wa = 
nt gokjentolnt sacl yvlavem aver ow, A qwods nolasmsotth Jeol aon 

a 
. A io7 


Asati 
ton ob sw tf ylno yYstnu mor} dnotsstib et (A)p (1.8.1) al 
a 


afd wom sw'bas . i ansv9 aeljo smoe svisedo 3W ~ A sirens 
: (A)p agefeolna oso sw somed ban , 4 bom A neowded tat 
Wiikidadorq Lstokstbmoo add ef (App nada, @ svreedo. ant Ae 


. (eg cannra = Cale 


a 
sd apo riotitw) Rete lah tedjoce Bed ouls Gs &.1) elumret ! 
.) 


od nk et 32. Stalks sich .(lebom meested 2 14baw eine ait 02 bepuber 
es ; 
-1ink of9 en 22 dasqisiat 4e> ow soged bre ,obts7 bonds gol ee ae 


otf! . A nobtevxéedy oft vd bablvorg , q sau Pp Bo io a 
,fsutqsonon nails oyom Serer] at ami3naexqxo4n) ows anna sowed oo 


ee Se 


a 
7 
r, 

. 
ales 
‘7 

» 
~baogse gilt fies Iltw ov nay iid at na somion ronal ose sw a Z ys 


: ek,88T dbs etaotn notsentmsaaaib notes 
san 54 A 
; LON - r i ii 


af 


oe If B is a family of events in A_ such that B, A a = 0 if 
i#j and vB=A and such that O¢°3, we will ‘call B&B a partition 
BEB 


in A. Itseholklows: that ddvews mtemosticountablie: elidiah ules) finite Ale 


will speak about a finite partition. 


Every partition generates a sub-algebra consisting of the join of 
its atoms, and conversely, every atomic sub-algebra defines a partition. 
We will often use the terms partition and atomic sub-algebra ambiguously. 


Of course, every finite sub-algebra is automatically atomic. 


We can intuitively think of a partition as representing an exper- 


iment to determine which of the atomic propositions B in B is true. 


Dee We can define a (simple) random variable on a finite probability 
algebra as a function on the set of its atoms. We can also define elemen- 
tary random variables as functions on the atoms of a countable partition. 
Note that the Boolean algebra generated by a countably infinite partition 
is, except in trivial cases, uncountable. We can define the usual 
function space operations (sum, product, scalar multiplication, positive, 
negative part) on the set of all elementary random variables, where if f 


and g are defined on B and C respectively, then f wg is defined on 


BvC for any operation w . 


With these definitions, the set E(A) of all elementary random 
variables on A becomes a vector lattice, and we obtain the set of all 
random variables (A) as the order-completion of E(A) . It is well 


known that a non-negative measurable function (in the classical sense) is 


a _= Ww Oe 
a a 


tr Oo ~ et 42 sailboat 


nolziatag 5» ites Hw a9) = 


ow solait ab & UW binant ey 


Io ole! ons I¢ gnitatenus aidsyle-dia 6 Us eicameanl wit 
notsigtsg s aentteh sidsgis—doa stands yrdvs elearsvens bee <anoge ak 
“Ieavogidus eidsgie-doe simoge Sn8 of bans ool Jisiaq wrres sfd seu werio Liz sa 
-oimots yites rpaodish al stdagla-due eitalt yteve seruns 20 


‘ 


5 
~1sqxe ns gatineaszqs7 en nobslazeg’s to dnida yisyiaivaat nao. aw a 7 
outa 2t @ mf &  enobatzeqorq akmods sd3 Yo totdw ontireseb 09 anemt ; 
7 al 
i 
yvatitdadorq stint? » no eldakysv mobosy (aiquita) -s ontiab ABS ew suf 
: 1. 
: - tT] 
-qemiis aniish coals so 9W .umoSe agi to Jon oy mo noljooul 6 am oy 


.nolstyrsy 4ldsdavos s 2o eamis sit ao en0ltonvt es ealdetaay mobaet <a 4 
nokaistzeq siiniini yideigueo. s yd begaramg erdoghe muatool ods sesly eo 
[ajay od eabteb ues SW .aidaanuoony esse Laivias mk-sqsone) yak 

ovis booq. (abkaasliqtalum ssfao8 ,so0bd39 me) anoljeisqo éasqe solsonul 
X Ub stedw peoldatzey mobs’ eisanemsts: ifs lo tue silt oo a EE 
no bentieb ef 3 w 2 rr ee ree ro honiteb sa bil 
w sokgaraqo Yrs tom oe 7h 


eobnex yruansmesy Ls Yo (A)a joa od3. ,enctdiniteh aust: datW 
Lin 20 ca ails aiades! ou Les AniRiceaehskem 
Bim ab ot Jo nol SpigM: 9O=ISNIG \ 
ne 3 Ane ae ape 
asad (iA: aba: 13 sidnquassm oveIeysp—nOI 
oe yes 5 


- 


_ aes a ee : ae 


the monotone limit of simple functions [37, p. 224], so that Kappos' 


approach is equivalent to the standard analytical approach. 


Again, we can define expectation in the obvious way on E(A) 
and extend it to L(A) < V(A) by continuity, following the Daniell 


approach [26, chapter V]. 


Zo A Boolean algebra is, of course, a Boolean ring. If 68 is a sub- 
ring of A , it may be a Boolean algebra per se, although it is not a sub- 
algebra of A. This is true in particular if B is finite, or if B is 
a principal ideal. Principal ideals of a Boolean algebra are of the form 


B= {Ae A: A< B} for some B. They will be denoted by BA . 


We will denote by F(A) and R(A) respectively the sets of all 
sub-o-algebras and sub-o-rings which are algebras. They are both complete 
lattices [2, p. 49] with greatest element A , and least elements respect- 


iVelyma= 10s) gand. Nee (0). @ 


If C is any class of Boolean o-algebras which is a lattice 


under the operations 


AAB=An 5B 


A v B = smallest 0-algebra containing A and B 


we will call it a lattice of o-algebras. It may not have a greatest, or a 


least element. 


We will say that C is a hereditary class if AeC => R(A) < C 


al 


‘ aogqad oe oe {65S .q . VE). ahoksomu 


eHosorggs | feotagisnn: brobeeds on ‘eannrtnes 


1 4 Ps a, 
(NS no Yaw avotywo sdz of notte299qx sotiab Ako ae yatsyA 


seat ais anivollod etateatinos xh (Ay!) a ye oa 12 be 
; | -{¥ zovqado ,0S) 4 


~<dje eo) & ID «gnteomesidohs ,setu0. io Qat pxdegis nmaslooh 4 


~dye 8 aon et 3b dgvodiie .sa x89 prdagia enoilood s ad yer tt, A wu ginka 


at @ tf 10. 99tat? at 8 ‘2b selwotors¢ ob out 21 atit . A ido aaa 
wxot sft Yo sxe sudegie noelood a to alaebt Ieqtontyt .Teobt ipichouiny “ 5 


ha vd beyoneb od Likw ont . & smog rot (8 >A TAS AD we) 


tis to wise of4 yloviapsgees (A)A bap (Aj1 xd a90nsb LEtw sw 
sieges dzod sts yodT .ea3zdsgia ozs doidy agmia-O-dua bas sondegia-o-dhn 
—Josqae3 ejacmsle tebe. bos , A. Sosmels 1a9teo78 siztw [0s 4 So) nsotaaal 

. {0} =U bar {1,0} = le ie ; 
_ 


: att 
oottisl a 2k dotdy asrdegts-p nue Fook to basic te. st 7 a 

’ | q- 
: anobseraqa. on: 


- _ 
5 av 


- AG h= BRA _ 


A : . 
A bere A’ gakmdasnos) sidoyls~0 salons = @ wh, — - ay 


ei 
oe 
oa is adios 


7 . 


& 26 ,Yesdee%g 6 ove don yam ZW -curdegis-o 
cin ST 


10. 


We will require the following condition for any Boolean o- 


algebra A we are studying: 


There exists an increasing sequence of finite sub-algebras B 
n 


such that 
¥v B =A , C2 ep 
Such algebras will be called separable. It follows that for any probability 


measure p on A, A is also separable in the metric (A,B) = p(A+B) 


(20). 


2.4 If O#Be8B_,, then the principal ideal BA = {Ace A: A< B} is 
also a Boolean algebra, and a probability p on A induces a probability 


Pp on BA defined by 


Pp (C) of Cas By se (2.422) 


We can extend this measure as an (improper) probability on all of 
_ p(AB) 
Pp 6A) p(B) (Zot) 


the so-called conditional probability given B . We can make it a preper 


probability by considering A modulo B°A , which is isomorphic to BA. 


In this way, any finite partition 6 generates a family of 


probability algebras, one for each atom of the partition. 


eee 
(b.€.2) ; Dy A bod a ak i 


, 7 


7 : _ 7 | ; 

qeltdadery yas, tot Jan avol at. Pini are oe 
A 3 
(dtA}q: > -(E,A)9 siazan 5 ib -biiteseaes out a! Koh 0 % emenne 


— # EVay. 
ni @ 


et fe> a: ASA} =e Ingbt ngdoniag, att waits 83840 it 
qaiitdedowg s/esoubnt A no q) yatttdsdoxd 6 ss salelat ieee = 
b icaeaiall Lad 7 


ana 


(t.%.8) 5 M295 $4 = carga 


to ER ad Yoh tdaderg (saqorgat) ge en Syvenom etdd baeses) aks OW 


a, 


_ 
7 


ee 


(S-8,5) 


a 


“yeqorq eat _ ns Ms a ost ape -* 
i aaa Stake, ea 4 


1A 


If f is any function defined on a hereditary class of probab- 
ility algebras, then given an algebra A and a finite partition B° in’ A 


we have a family of values 


f£[A|B] = £(BA) (224.3) 


indexed by the atoms B of B. The above expression in fact defines a 


simple (B-measurable) random variable, which we can denote by £ RIA De ket! 
we will denote the expectation by 
£(A|B) = ) p(B): £(BA) (Qe inky 


the summation extending over all atoms B of B. 


If A and B are sub-algebras of an algebra C , but we don't 
have B<A , we can still define £ [A|B] via (2.4.3) and hence £ (A|B) ; 


where by BA we still mean all events of the form BAA , AcA ,» although 


it is no longer an ideal of A. It is easy to show that 
£(A|B) = £(AvB|B) (2. 4&5) 
and that 
£(A|A) = £(T) 


If pD is any separable sub-algebra of C , and if we have 


finite B +90) , then we have a sequence of random variables 
n 


2) (254.6) 


-dodoxe ao nadia, yenathinwed « co beni mataoma yaw ak 
how a nolriyraq stinlt #. oul A srdogle ‘nn novia nf 


asutay to ~ A 63 pen 
a eee 
(U.8.8) Chaya = palAla | a 


7 


. : a 


B® Ranlleb 35987 nt noleasrqeo Svoda sat .8 Io a emtoin on? xé = 


_ 


baa . [*|Atp2 ¢d stonab neo aw doindw ,aldsivav mobawa (otdeiounee-8) 2 


vd nobissosges | ‘ona ey side a 


Tas | (Aa)? (qd = (lA Oe 
4 i ue 7 
. & to &  emoss Lis tevo sofboesss cotgeamue: md? 


a 


rT 


q'mob aw aud , 0 sxdegfe me Yo serdesis-dwe sxn 3 bon A 22 a 


» GAD sonar bre ‘e df) piv [alA]2 snkteb LOtje oan sw oA > 5 


‘= 
- 


dquedaia , Aaa , AA arf¥ol 93 To aznevs Lie nese Iikse aw re xd ome che 


tad) .wode oc yess af Jd. A to fash. mm, tagno — ae 
1 ‘ a : 


: ; 0 i 7 f 7 
(2.0.5) (3]BVA)3 = (B]A)2 7 
‘ : | : ties i 
iy _ _ 7 
eat _ 7 


‘i 2 = (AIR 7 | 
A a er ate cs 


ne 


; a en er 


-, ™ 


rae 


and if this sequence has a unique limit (in some sense) then we can talk of 


the random variable 


£(Al*) (eae) 


and its expectation £(A|D) - 


We will adopt the following terminology: The random variable 
will be called f conditioned by D , its expectation will be called f 


conditional on D. 


pes Conditional probability with respect to a non-atomic sub-algebra can 


be defined as a Radon-Nikodym derivative. 


For any A, P,(B) = p(AAB) is a measure on 8 , absolutely 
continuous with respect to p restricted to B , hence the Radon-Nikodym 


derivative, p(A|B) exists [26, p. 144]. 


Recall that we have only one impossible event, so that essen- 
tially we are identifying functions equal almost everywhere. Hence the 


derivative is unique. 


Furthermore, if A= Ay Vv Ay 0 = A, A Ay then 


ECL, (p (A, |B) + p(A,|B))) = p(A,aB) + P(A, |8) 


p((A,vAy) A B) 


p (AAB) Vv Beeb. 


i 


ete “ae ee Bi ae - 


(Td. 8) [Mgt 


+ QOIA)A ackrsszeqem et 

| . 2 te eee eee an | 
sléaliev mobns sdT sygoLonleres untwolle? odd aqgohe [fiw aW- 
4 belies od Lltw nolierseqes est , 0 vd benotds baos 2 baltas sd hw 


> no fpnot 3 


= - 


i 


#85, ard$ale-dee otmose-non s ‘04 Josqaei dstw yatltdedorg fenofatbaod | as 


aviseviwsbh mybowlM-nobat ® sa) berkieb-ed 
. I 


viegituads , 3 no ovvessm a eh (8NA)q = (4).9 . A ¥ie OF } 


-_ - 


wYybONIH-nelat oda soned © B of bsyslizess q 0% tosgaes dtlw ene 


[OE vy OS) agakxs (BlAdg javisaet 

a 

~taas Jad o# -tuave eldteaogat anc ¢lao, ie ow — Lisoes 
#2 wsosk .stédwyrsv5 dutta? [nape amo d toma ‘poteh swabs ‘816 9Y ye tsa 


-auphaa oh svttewiereb 
+ 


i 


malt gh i a A y xh ae toasty 


(8) aya + aaah alae (8) cial 4 


a ets 


fos 


Hence, because of uniqueness, 
p(A, |8) + p(A, |B) ROWE) | Cae 


Also, if AS oA > then p(A_ |B) is increasing, and by monotone conver- 


gence, for any BeB 


BG ay p(A,|B)) = a ECL, one) 


ga p(A_ AB) 


p(AAB) A (2.582) 


Hence, again because of uniqueness, 


sup p(A_ |B) seeps |). an (20523) 
n 


We can thus consider conditional probability as a continuous vector-valued 


measure. 


Conditional expectation could be defined as expectation with 
respect to conditional probability (see [21] for integration with respect 


to vector measures), or directly in terms of Radon-Nikodym derivatives 


(2oecps 225) 


240 Two events A and B are said to be independent if p(AaB) = 
p(A) p(B) . Two algebras A and 8B are independent if for all Ae ce 


BeB, A and B are independent. It follows that AAB= 7s 


7 7 
' »o 


it 2 v9) . (BJ Adg = 8] Aya + @] Ada 
-Teyaos enodgonom vd bas ,antesexont at (8) A)4 anit , At a it ' 


- 838 ym 16] .8om 
a, 


=n 


((8| Aq gt)3 que= ((8] Ade que tS 
i o i 


(an -A) a que = 


Ks 
($0828) | »  CaKAQ = aoe 
5 ’ 
.e2onuupiou Yo eenfioed alaga , 2i198 
(£28) . (a)aq * (B| Ada qua 
fn 


: 7 
houluv-ro2sev suovpatinos 6 ap voit idsdoerq Lanotsathaos ySblen audi mo 4 _ 
: ~ 
—_——- 
= , 
a) 7 i 7 


— nolisinequs es bentisb sd BiOD aokissosqx praeaii adel 7 


hale dtiw noktetgesot. toi [LS]. 292) weikidedee Lenoks3bm09 og a9 12 20% 


sae ¢ 7 


eovkieytisb ayhodth-nobas: 10.1 amis2 of Yisoeilb “0 _ (erties we 99 


i 


: : - 

* Cn hx snpbaagatint eae: ree sa 

Ls yo? ii pee eed Eo oOK 
7 


— _ 
us i Binh scale oa! ibs. at 9: 
- . 


aA. 
7 


14. 


Two algebras A and B are conditionally independent given a 


EnirdealeebrasiC 5 if for ally AeA Bene, 


p(AAB|C) = p(A|C) p(B|C) (2.681) 


If A and B are independent, and finite, then the P, of 
(2.4.2) are identical to p for all B. It hence follows that for any 


function £ of section 2.4. 
£(A|B) p= (Aye (Zr. 2) 


If C is atomic, and A and B are conditionally independent given C 


then 
£(A|BvC) = £(A|C) (2.6.3) 


for the atoms of BvC are of the form BAC and Ppac = Pe 


3. Entropy and Discrimination Information. 


3.1 Let us suppose that we are in a situation where we are independently 
receiving information from a large number of algebras isomorphic to E 

Equivalently, we could imagine that after having an event in E confirmed, 
the situation changes, and our uncertainty is restored. It is then reason- 


able to ask about a measure of average or expected information. If F is 


a sub-algebra generated by a finite partition, then the expected informa- 


tion in F is naturally defined as 


oe, *2 


er, Sl ae Se 
- i a : -_ ; : 
1, - 7 a fr 7 , ade ye to an a ay a a 


7 
i 


a sovig snsbasqabot \bbainesatbaes sxe 9 baw eum a 
’ 
» Bae. haw thee yy, evden td : 
ae : _ _ 7 ° 
i a : a 
(£.8.9) . ©)ae O]a = (Sanaq . : 


1 4 eft asd? ,ottet? bas ,duebaeqsbat sta a bas A 7 7 — 
a a : 
une Yo? tod awollo? ssnat 91 . & Dip xo? 9 od Teokoasbl ors ($.%. Sy) 


.'.£ motsoos Yo 7 moksoan 


(f a4.8) . a = GAIA 


deLays) COVA)a = COVaADA 


ee ee wxci ait io 576 over io smogs si 


ae cokienkal tad bee 


« 3 od sldqromoel onndaigie 0 oles opis 8 ee 


hist i109 J om anova nk soto 91 pam saelsniond min 
=toaeet “ody | ae a -beriseos at vant ra ™ 


oni 
; on hig ads thats J 


y » 
5B 4 ¢ 7 
_ 


t ra? 


at 5 cha -ostaeerin pastes 
: “aoe on panes st ee 


ind BE ones ee vals 
100 oe © ae F ‘a _ 
ss ate! 
— oo 


a 
7 ? 


a 
at 
7 4 


154 


H(F) = - } p(A) log p(A) , (351.1) 


the summation extending over all atoms A of fF. This definition also 
makes sense in the case that F is generated by a countable partition, but 
in this case H(F) may be infinite. The quantity H is usually called 


entropy. 


Definition (3.1.1) can be derived axiomatically, either presup- 


posing a probability on E , or defining the probability in terms of H. 


We can define entropy axiomatically in terms of a probability. 
Such axioms were first presented by Shannon [39]. Various versions and 
simplifications have been summarized by Aczel [1]. Since the entropy is a 
function of the probability distribution, the axioms are often expressed in 
terms of n-tuples of positive numbers summing to unity. We will here 


rephrase them in terms of Boolean algebras. 


Let the entropy H_ be a function from the category of all finite 
probability algebras to the real numbers. For any finite algebra A , and 


a sub-algebra B , we can define the entropy of A conditional on 8 


according to (2.4.4). 


It is reasonable to want, 
H(A) = HB) FACAIB) “foreach 8 = Ale. (Sa2) 


It turns out that formula (3.1.2) along with the requirement that 
the entropy of a diatomic algebra be a Lebesgue measurable function of the 


probability is sufficient to characterize (3.1.1) up to a scale constant 


(30). 


is _ . a | ; i - : 

. - | : 7 - 7 [7 ; | 
: _ [oe i. 7 ae ee | 
- - 7 : : a 


(bbe) Marat og T= ee 
| om ve 


cafe notdiniiel etAT . 4 Fe A amore ffs yovo guthnetxs scloeaye oda 
tod .notsttyeq sldainnoo s yd besexeteg at “FT Iniz ons si2 nt saces eodem 


belies ¢iieveu et H ystsanvp SAT -edtatint sd yam CH ses eida at 


: 4, 


oy i 
ay 


~qles1q xorits ,vyilsotjemotxs beviish od an (£.£:€) notgintied ar 
a) 


. H Yo ames at vdtikdsdorgd edd gatelisb to , FT wo wllidsderg a 


.yitildadotq o to amrs2 ot ¢flacisamolxe yqousas antish eo oW 
bas anoteisy swoithy. .[@2) homapde yd batnesetq Jenh) sraw emdinh di 

- 
s ek yqotine sdy sonte .{1) fexch qd bastvemave need sved enolomsotthiqet 


: . a Tt 
ai beaesyqxe asylo s26 emotes oft ,notindisseth yal itdedorg ean. to soljom? 


7 a 


er3ii Lijw oW .ysiou oF gntamue aysdmug avistaog te nnicqusn * 
P & is 

-enidsaie onelooll 46 aurret at ears ocum aan 

i 7" r 
attati tle to ere gif mort moltoayv? & od re yqotsns sda sod mae 7 rn 


bos , A ariiggt« astatt yan yor .eyadmun Las = 2, saxdagis ttt ican 


i ao Innolsibaoo A to yqotsae, Slt an iabs map ow , a srteghiendun 
: 
er XS) ed 9 7 Ke 
re . 


_ | - a 
| . ) anew of sudan es. 7 


; (Set) Te AEs dans Tod gion iy cas 
ov) a | : ae a vite — 
a lias — iio bw wee aes E) lun _ 9 jad du ue aa 


ro a ea 


1G? 


in fact, Lt is sufficient to require that (3.1.2) be true only 
for sub-algebras B , one of whose atoms is equal to the join to two atoms of 


A, the other atoms being identical to those of A. 


Bee We can also characterize (3.1.1) using a limited definition of 


probabilities [23]. 


Let F be the category of all finite Boolean algebras. F is a 


hereditary lattice of algebras according to section 2.3. 


For any Ae F let S(A) represent the set of all sub-rings of 
A. If $ is any homomorphism from A into B , it induces a map dy 
from S(A) into $(B) . A sub-class Hc EF is designated as a class 
containing homogeneous algebras. Within this sub-class, isomorphic algebras 


are identified. 


For any H€H , we define a probability by 


N (AH 
py (A) = cai (321) 
where N(A) represents the number of atoms of A. Let G= U S(H) 
HeH 
We will first define H only on G. 
We require the properties that 
(a)- St Sve i and ~ ais an automorphism of H then 


H(p,(B)) = H(B) for all Be S(H) . (seee2) 


ae 


. 
+ he 
ee = 


et 


ee © S 3) 


a " 7 7 : a 


uw : | 
eine gua ad (8.1.8) dandy s1iupey os wake we eb at stoet of 
to aman Ows OF alot Sid of aa at amos omar te ano , 8 sentagia-dos 30 


ee 


. A to seort oY Isstanebt gated sme28 OT 4 
( ; 7 &# 
; 1741p: - 
in noldiatteb ioyimit 6 gotew (1. L£ut) ox lastop tans ole nao ow 7 4 


‘aftS} oststtidadore 7 
ine 


; - 
s ef 1 vasadegls Neslaod od hutd Ile Yo yrogedns odd od J sad 


/£,% notisea oF gakbroo08 verdegis Io solssal wsaatbe ont 


to agnit-due [le Yo zee sit ageretqeax (Ape Fol 13 ‘7 Yas 267 a 
Jb Gon s esoebnt 32 4 8 oyat A gor? mebloranoned yoo at ¢ 27 ae 
genlo.e a bedemgtesh st A> wenlordie A. (BPE vant (Me wont 
anudsgie sidqtemost ,2enfo-dua elds aidath .anadegi« ayosnegomon gniities 105 
| | .bokttsosbh ans 


mn 
vd yaiitdidesq Banish ew , hat yan tot i: >. 


pa? © . HAL = ; —— a 
(hs.t) ft ie 
. OS u « » jal o) oh bi amose Yo Sioeans wi? esayeerge i 
Wall ; j _ 
- “ 5) ito all H sniieb s1a83 titws 
’ fs ; i rr 4 
a on if 7 1 Elsa! oy aeaal wl 
: ; | -_ - : - : 
; oor. 
ty h te eter hs p. i) a 7 t (0) : 
ow ey pe - - Oe te a = oe 
. (St) a ee) 3 P ait oO} AGIH CB), f}s Oso ns ee 
who Oe oe oe PA hme ee 


ee ie .t 


ilWs 


(b) H is isotone on Hos i-e.pgtte CoH ei and) G <i) then 


H(G) < H(H) G23) 


(c) If HeH and G is a sub-algebra of H (not necessarily in 


H) then 


H(H) = H(G) + H(H|G) (35204) 


H can be used to define a metric. If A and B are isomorphic, 


let 
6(A,B) = inf sup |H(C) - H(o,(C))| (Gr >) 
> CeS(A) 
where @$ ranges over all isomorphisms from A to B. Now, if A,B are 


not isomorphic, let (A,B) = 1 , otherwise let 


(A,B) -3en (3.2.6) 


Then p is a metric on G _, and under this metric, H is 
continuous. If we form the completion of G with respect to p, and 
extend H by continuity, it can then be shown that we can define a prob- 
ability p on each A€eEF , and that 


H(A), == das Py Wa log p, (A) " G2 
AaA 


the summation extending over all atoms A of A. Furthermore, these 


probabilities are consistent, in the sense that if B is a sub-—ring of A, 


then 


| | . 7 - 
- - - 7 - 
. a ; 7 i 
| _—- 7 
ast) H>9 to WoO It ak +b go andor: et ff 
(EsSs0) . (yn > (pH : a 


: : ‘ 
nl ¢ilaekasssn ton) WW to evdegle-dua 6 ai 2 boa WSH (s) 
: nods ay : a ee 


ti 1S.8) . a+ @er= Gon | - 


ldaiemcat ove 3 bop A 42 .olvtem #2 ontish oF bean od a5 6M : 


(2,88) 1K), $98 --(O)B] que 9 Tat = (8, A)0 
(Ape29 


7 
ave ASA UP ,wok . 8 of A thovd emetifqyvowioet Ils revo esgasx > @ aw 


tak gaiwreito , [ = (A,A)q tel .sidqronoel 368 


AyA 
(au Ss) 5 aS = (3,Aja 


at H  obraem elds yvebaw bos ‘ 2, no aad ais bak Q sent 


a 


> 
bas =. 4 od aeends stabw 3) # GOL ISS qs ody m303 sw ti = us 


a 
haan & anttst Mad sw fads rte, od one ia 1 yatwal'snns wn ab 19.7K9 ; 


| sa bon, 13 A dae no aq yontide 


—s . 

: Les ue : 
ial ade i ane ee 
ri _ - : oe - 


= : bapaecantcel an a Ao 7 eam fie tayo ghtbnss«s A Lo 
¥ 


8 ua ot hn oh 5 


tat ae ay ae ‘} 


ce 
“e 


(Veeie), . R e i 200 0,0 


vy 


uk 7 3082 a. 4 fete: 7 
i a 
5 


se 


a 


18. 


p,| 8 


as Dae C3223) 


where Zp is the maximal element of B and by p,|B we mean p,alB 


restricted ‘to’ *B*! 


3.3 We can also talk about average information in the generalized sense 


Cian Cia Ly 


We can consider two finite sub-algebras A and 8B , the first 
consisting of events of interest, but not directly observable, the second 
consisting of observable events. We want to express the average informa- 


tion in A , transmitted via B. 


If we observe Be 8 _, the gain in information is: 


p(A|B) 


Jp fA) = log “p(A)_ 9 


(Seea) 


so that the average information in favour of A when A _ obtains is: 


x p(A|B) 
= Bia) Log = ae 2 
(A) = ] p(BlA) log —a5 G.22) 
Summing over all atoms B of 8B. The average information in A via B 
is hence: 
p(A|B) 
R(A,B) = ) p(A) ) p@lA) log Say— Ga) 


A B 


summing over all atoms A of A and B of B 


(8.2.8) 


Shaq BBO aw Sh ya yo bes 4 to aremelo lamlxom sd at gh | 


<8 anne a 


senae besktlatoneg 2d9 at woldamrdinl sgeieve Joods Wisd onla ces ow te 
(1.4.8) 0 

, 

: 7 

jevtt at . A bos A eatdegis-dua sdinit ows tebtanos mao av i : 

v ’ 
haoose ot ,sidavegado yloosrib gon sod ,;tesxeJot Yo eJneve Io ynizetenes 
; pays P 
~pwro}ni sgarsve od] ges1qx® oF tanW SW .eInevs aldeyrsedo io gakzesanve 


. 8 stv bertimeness . A at voky 


_ 


tak nokjeursotal ai ofsg sft , 2.3 4 syxsado aw 21 


(a }A) 4 
(1.€-€) © “Ge: gol « (Ad at 


tah aotasdo A naiv A Go teovel at noliserotal sgexeve ada tnd3 oe 
’ : 1g 


(eA) * 
($6.6) “arr gol (ala f = (AY Th 7 


8 giv A Gb oolsemroinl ogextve oT 8 . 8 to 8 amore Ein 1 
a . | oY 


- 


Ade 


a me kk — uot ee] or 
: bas Hy 563 2 oo 
arakiy x 


_ 
yy 


(fvf46) 


,, mod a f. m 
on pry 


7 fe * isi 
i). 


i; 


A typical application is communication theory, where A repre- 
sents transmitted symbols, and 56 represents received symbols. We know 
the probability distribution U of the transmitted symbols, and we know 
the noise characteristics Vv of the channel. V is expressed as condi- 
tional probabilities v(B| A) of reception given transmission. We then 


have p(A) = H(A) , 


p(B) = )} v(BIA) H(A) 


summing over all atoms A of A, and p(B{A) can be determined from 


Bayes' rule. Expression (3.3.3) can then be rearranged as: 


R(A,B) = - ) u(A) log u(A) + ) p(B) } p(A|B) log p(A|B) 
A B A 
= H(A) - H(A|B) (3.3.4) 


the summations extending over all atoms A of A and B of 8B, and 
we see that it is the difference of two entropies. H(A|B) is called the 
"equivocation of the channel", and R(A,B) is the rate of transmission 


ees 


R has also been used as a measure of the information provided by 


an experiment in a Bayesean framework [31]. 


a24 Similarly we can talk about the average discrimination information, 
as 
A 
I(p,4q3A) = } p(A) log jms (375.9 


q (A) 


. 


at . | 


| a ae 
7 . at | : | 7} _ ; : 7 canny 
. a 
“giqay A sretlw eyrosdd wa tuo al nabsestiqas Toots an 
wound 9W .elotimya hevisoos | BIneeoTqs7 a brits paLodare: beasiwane? aan 
wort ow bone ,elodeya bettimensss oni Yo 4 mabdupisaiit scehiinitedbiaiae fa 
tenneds od to ¥ estvelieannweds wee 


snot to (ala eetiiLidadorg saris 


=Ibnoo 2s’ beaestqxs ei Vv 


ceils 3W .ootaekasars) nevig aoliq 


i (Aye * (ea av 


(ayy cal ayy { = Cada : 


i 
aor? bentmysteob 3d. aB9 (aluya bos « & lo A enode Ifo 168v0 aeheies | 


‘2a bagnatrzssy $d cadd aks (€.£.€) soleestqged -siot.* aeysd 
Pe 


(al Aya got calada { (aq : + (A) gol (ADU { ~ = (Apa 
A 


(W€88) (a)AyH — CAYR = i 


bre , 3 to 4G bane A ae A amowe a 19vo gnibnasxs re 
aid beliso jet (aLAYR aeiqorine owd Ye Sonovstttb, sda eb at saris oe 
aS 


nuleetwena's 16 sdex sili ak COI “ial hemawits sif9 to 10128909 


i ; - 


{™ < : 

ed hebtvorg aoivemrdint s13 to -‘stuesem 5 Ae baen nasd onl aul A. 7. 
.' 7 7 | or s @ @é ; — 

Chee, srrowont? neseaead a at av08 
7 oe 7 “7 


aati ennantly i st am» ow vont = bE 
| Mr ag 


7 
7 


20; 


the summation extending over all atoms A of A. I can be thought of 
as the average information in A in favour of p against q , when the 


actual probability is p. 


It is easy to show that "rate of transmission" (3.3.3), (3.3.4) 


can be expressed as a discrimination. In fact 
R(A,B) = I(p,p ;AVB) C3742) 


* 
where p is the probability defined by taking A and 8B to be stochas- 


tically indpendent. 


It may be possible that events possible under q , are impossible 
under p , so that q may be defective considered as a probability on A. 
If events are impossible under q _ but possible under p , then total know- 


ledge can be gained, and we say that I[p,q] =™. 


When the probabilities p and q are understood, we will write 
simply I(A) . When the algebra is understood, and we wish to indicate the 


probabilities, we will use square brackets I[p,q] 


asd So far we have confined ourselves to information - theoretic concepts 
on finite algebras. Let us now attempt to extend them to the infinite case. 


We will restrict ourself to algebras A which are separable. 


Let us first consider the concept of conditional entropy H(A|B) : 
where for now we require that A and 8B are finite sub-algebras of a 
probabilitiy algebra B . This concept includes unconditional entropy since 


H(A) = H(A|T) , T being the trivial sub-algebra. 


1 . : _— : ; ; 
to niaaani sd ona 1 ae lo A emote Lis vaveganbuaee notsname — 
Z 
si3 modw 4 p sentaae q to sorrek ak A ob smi ane a ; 
4. at yaebidadong ae | 


(b.6.0) .(0.6.£) “coleslonnss? Xo tes" dads wode 09 yaks et IF 


‘(lar a 


: : 7 
joe? ol ned Tee bie 98.Eb 5 us hesadagxee od nee 


7 


* 
©.8,€) (AVA; Geg)t = (8 ,ADS 


a 
a - 


Ab’ * 
-sernose sd oF 3 bon A on ‘yd bettish yailidedorg ara al g, o2a8 


-_ 


.  tashbosqbhal ef Bs 


¢ ibe : 


sidtesogml sis . p tsbuv sidieeog sinevs jel sidtaeoq sd yao If : 


: ) 
no Qititdptiong s 28 barSbleaos svissaleb od yam =p seta oa , G webau 


i 
o- VA 


i : a 
-woud ingod oof% ,. q zsbn aidisreag gua p asbhny. slidtesoqmi soya einovs aU 


aw [p.a}i jsiis yore ow boa ,beniag sd oso 4 


stiew (Liw ow ,boosersbnu s%5 p bre q setsilidedorg edt nodW 
si? sJavibnt 02 dgkw ew baw ,boesarebms ot sidegis sid nodW . CAPT 
AW 


Su s 7 oe 
. te.g]? edsdserd otenpe seo Iitw ow ines 2 
, - 7 - , 7 


aiqeoqoo ofsazoens - nobaaanwieks 03 soviloaxiio yout sved Sw - aes 
7 _ - 
Svan stintent sda 3 ads noose oa seo wt an gad. — 
a 

ith ok sede 19174 
“i i | 

4 

' Ba Lqor13i tanokathpos: do 249: 


- : bdo navadagita—dee" ere aes a 


7 
joidezager 915 : flo 


=) 


‘deta equa ‘tenct tbnaam » nat 
7 _ - _ 5 


_ 
a 


Fait § 


We have already noticed that H is increasing in its first areu- 


ment. We will now show that it is decreasing in its second. 


First, observe that 


t log t too) 


is a convex function and hence: 


) p(B) g(p(A|B)) > gC } p(B) pcalB)) 


elp(A)]  , 


the summations extending over all atoms B of B. Therefore, 


H(A|B) =- } } p(B) g(p(A[B)) 
<- } g(p(a)) 
= H(A) ) 


the summations extending over all atoms A of A and B of 


Further, we note that if C < B then 


H(A|C) = ) p(c) HCA) 


and 


(3s9).1) 


(SeaR 2) 


(calayy (aq. Tes ccelanada ¢ (aq 


(L208) ¥ (tajaia = 
-siotevodT . 8 Yo @) amode fe) 19v0 gnlbnstxs anos: 


(Celaya) (8)q ( i -= @)he 


(ayaa | -2 7 
(£.2.8) CA)H = j - 


~ Bio 8 boe A yw A bse Lis. wep, i 


— rel Pads pants meas hs : 
Brig , ie Ete! 


se ie | eee seta a 
ee 


ao ; ae Napier 


Lon 


H(A|B) = } p(B) H(BA) 
= §) upGdeHGAlca. ; (3.5.3) 
the summations extending over all atoms B of B and C of C. Hence, 
H(A|C) > H(A|B) (355.4) 


We can thus define, for any algebra B , 


H(A|B) = inf H(A|C) , where C is finite. (22545) 
C<B 


Also, we can define for any algebra A 


H(A|B) = sup H(C|B) (335..6) 
C<A 


3.6 We now turn our attention to discrimination information. It can 
readily be proved, again from the convexity of t log t , that 1(A|B) 


is increasing in its first argument. 


Hence we can define, for any algebra A , 


L(ALB)ebasupsl (GiB) wens (80621) 
C<A 


However, I is not monotone in its second argument. It is 
tivesuint dt Aes pee Chat T(A|B) < 1(A|C) SUDUCe LL tA] and abo abe 
incomparable it may be false. This is not surprising if one recalls that 


I(A|B) = I(AvB|B) 


“ss 
: | i> es 
(hay a T=) 
va 7 : 
(€.8s€) . Bojan (oa | : 


a 
— 
_ 


S 
» 
a 


oot. 7 Yo > bee 5 to ‘A emoas iin t9ove galbaedxs enorisummva 


(eu) .  aldin < OfAs 1 


i serdegis yrs 162 ,sotteb euns aos ow 


’ 
v 


(eet) sg)aH? at 9 otbdw , GYPAE a = (alAps 
Pd 


ae 7 


A sidoylea yon rot 9ntish aes ow , vad 
ee . (2PM qua = GAD 
Ard 


re 


ae 43 .notvensotal cotssatmizsath-os soba as 390 Ayu WOH mo aE - 
, 


(AlA)r seta, 1 gulf s Do ys Erevios ads wage a <bsvorg ad ae 


«29900918 saas ‘wk ab ee wile 


erg 
7 i 
—_) 
A rea Vou 201 sentiab meg sw ae = | Pi 
. a : q 


‘ 


. 7 : 7 ye 3 7 7 
(t.00) z GPPA gw re ” @lr baie se ial 
. Aa ; ote Ae 
5 eee 


es e spe cial adh nb imi 
Bye. ae 7 LE: me . © 2 I i. 
phe ae sue 2, a 


Ae 
ners i bana 


23. 


A relatively simple counter-example can be found: 


Let the atoms of A and B be respectively A, ,A and 


if 
By A B ; B, - Let p and q_ be defined by the following 


tables, on AVB. 


P q 
Ay eq 
Bo Bo 
a a 
ap, BS 


Then we find that I(A) = I1(A|T) = 0 but 


5 3 
Z 108 2- Z 108 3 


1(A|B) 


5425 : (3.6.2) 


It is quite in keeping with intuition that ancillary information from a 


previous experiment could improve the informativeness of a present experi- 


ment. 


i i icit if is an increas-— 
However, in spite of lack of monotonicity, BF 


Vv = h 
ing sequence of finite sub-algebras such that a By = B , then the 


-) does converge, as does the 


sequences of random variables Ip (A 


= — ia | — 


ts 
‘pas 4 P of rhembsseqass od 6. bag A “io ena sits ak 
aniwolio? ady vd abide 346 p bas ig mr, « “ey > a e e . my 
av ao ,esldny A. 
on ar a 
. : - 
F ‘a 
jo a” = 
| 0 o® 
a _ - 
“; | eh: e la 
sud 0 = CT[AQE = (ADI) aed batt ew-aodT, 
é - 
: | oa | . > 7 
£ gol <- S got . = (a, AT -_ = oe ’ 
>, > 
- 7 _ 
(8.866) RRR we _ 
c- ; = wv o : 


oe 
-brsqxe jndeorg b to evonaytsenrotad ‘sift err 


— 


a a o 


7 : : a 
- j or a — 
~seenpad aaai 8 a _estotuoseit To aoe 19 934 a8 
; it : 
sila atts, 2-8 4s ft one si vis-dus 
- 


24. 


sequence of their expectations, hence we can unabmiguously talk about 


*) whenever B is separable (see [18], proof of theorem 


I(A|B) and IR(A 


263) s 


The limit can in fact be represented as the discrimination infor- 


mation for the conditional probabilities of section 2.5. 


4. Properties of Entropy and Discrimination Information. 


We will here list some of the more significant aspects of these 


two quantities. 


4.1 (a) Entropy is non-negative and isotone. It is zero only for the 


trivial algebra 
A < B => H(A) < H(B) (4,161) 


HGCA) 270 HOA) =O)” att A= T5 Cs Bea | 


(b) Entropy is conditionally additive 
H(AVB) = H(B) + H(A[B).. (ied 23) 


This can be derived for the finite case from (3.1.2), substitu- 
ting AvB for A , and using (2.4.5). The property follows 


in the separable case by taking monotone limits. It follows from 


(2.6.2) that: 


jneda Aina ‘lan ougiindonts unis om. : | MnOLIBIDOgKS 7 
mexoona to Toor , [Bf] 924) sidatnqse af & svar 


-rotn} volssalmiroeth ady as Sednasaciqes od Jas oF aso mht oft 
> 
2,8 nokiaae 40 vorsiLtdadorg fomoks bnce oda 20% mi 
e a) a 
; 7 ' 


. a0,tsninxoi nd nottontmisakd bas . 


a 
| , oe oa 
sa9n3 1) eineqen tasobliogle 210m sd to save Jakl stan ILiw sw : 


-e9haloneup Gud 
= : 


* 


7 
me ; 
a 
7 : 

_ 


pn qlisnekithacs ek \ cqordalt ee . 5 


aij vot vino ote= ak a2 .snotoat boa syissgesr-non 2 yqotsnd (e) ay 
7 ah - 


prdegis Isivits 


(£62. 8) (3) > (AYR <= 8 > A 


(fo... 8) _ T=k 3 0 = 0AH 0 < (AH 


(e.td) - caja + (i) = om il 6 teas 


ue : 


~oobiodue, . (2-1. nal mort 9289 ezini?, eee, 
awollo? warsaor sft | Me. se) ee 
ork swortet = wa kal stoyotoe vgn Li 


uu 


soda ” 3.8 


vo ? Ree + ms: nia vi 
Ae ar > foe a in i 


7 


a he 


(c) Entropy is additive for independent subfields. A and B are 


independent 


=> H(AvB) = H(A) + H(B) : (4.1.4) 


(d) Conditional entropy is conditionally additive. 


H(AvB|C) = H(B|C) + H(A|BvC) (421.5) 


For atomic algebras the result follows from (4.1.3) for each 


atom of C . It follows in general by monotone limits. 


(e) Both conditioned and conditional entropy are continuous in both 


arguments. If 


then 


- 


Hp (Als) H(A *)>Ho (Al*) + H(A 
nh n 
H(A|B ) + H(A|[B),H(A|C_) + H(A|C) 


H(B IA) + (BIA), H(C_ |A) YH(C|A) . (4.1.4) 


For proofs see [34] where our conditional entropy is called 


conditional information. 


(£) Conditional entropy is anti-isotone in the conditioning algebra. 


- 
- “2 Pag : 

¢ ae rore 

* a ae 
Oy 

’ < 
pve” ae 
VL 


Te) : wes baie oe fa 


Sv ko RBBB, aus et ‘xaoua0s anitioed 


a a 


(eat) . Qvale + Olaa = O]avAm 7 
dows vo! (£.4.9) movt ewolio? afves: oda aordoais sheets rot 
‘adintl sausooom (d Lexortag ut ewolte? al . 9 Jo wos 
sy 


(aod ak avountanca sis vgowdas Ismotathaos bas banotsibags Avot () | 


_ 


Om Oi: ok Gt a 

} i i] 
<lyadt + (1M) cet* |Algit © (VA): ot 
(Adel eA Bite) gh CN 


TAL y ; 
ie ColasalAie Gale 
(Beds) oa beanie (Alt 


bakiea at, wio1sne sr 


. 7 * ¢ 
= La 


26. 


nD) (a) Discrimination information is non-negative and isotone. It is 
equal to zero if and only if the two measure algebras are iden- 


tical.: 


I(p,q3A) > 0 I(p,q,A) = 0 <=> p(A) = q(A) 
i] AcA , (4.292) 
This follows from the convexity of t log t , see [29]. 
(b) Conditional discrimination information is conditionally additive; 
I(AVB) = I(B) + I(A|B) ‘ (4.2.2) 


This has been proved (for separable B) by Ghurye [18]. If A 


and B are independent then 


I(AvB) = I(A) + I(B) : (4.205) 


(c) Conditional discrimination information is discrimination informa- 


* 
EDOM. Ed sie ss 4doq such that 
I(p,q3A|B) = 1(p,q”3A) (4.2.4) 
see [7], equation 3.3. 


(d) If f£ is the Radon-Nikodym derivative of p with respect to 


Cited 


Lip. dsWes Eniaee Den (4.2.5) 


Lie 


AsA  Y 
[QS] 998 , 2 gol 7 To ypixoveos 943 mor}? ewollod eit 
javiothbs vilenolstimen at nobsamzotmt cobjentainoeth inuetathaod (ay 


(S2Scs) . IMT + (1 = GvAT 


A 4D .[8i) s¢iwdd vd. olderaqse rot) beyorq mead ead ets? 
neit Jnebasqebal eva & bie 


(£.S.8) « (8)E + (AYD = (Avast 


~amtoini nokisnimisseth ef nolasimotal wot iantmtroath fetorskbaod (5) 
aia date “gp E j.a.k Lata 


Geka i erase 
. ; nN 7 yy i’ . . : 


27. 


This expression is usually taken as a definition. This result 
has been demonstrated by Kolmogorov, Gelfand and Yaglom [27] and 
Ghurye [18]. 


(e) Conditional information characterizes sufficiently, i.e., B<A 


1(A|B) = 0 iff B is a sufficient sub-field for the pair {p,q} 


(f) Discrimination information is continuous. 
Lt A, + A then T(A_) ASE CA) ws (4.2.6) 


Proof follows from the convexity of t log t and appendix l. 


4.3 Discrimination information, in measuring the ease in differentiating 
one probability distribution from another, in a sense measures a "distance" 
between probability measures. It is however, not a metric on the space of 
probability measures, for it is neither symmetric, not does it satisfy the 


triangle inequality. 


However, one can define convergence of probability in terms of 
discrimination information, and this convergence is stronger then conver- 
gence in the total variation norm [8]. In general, however, the conver- 


gence structure is only a Frechet~-V-space [16,17], and not even a topologi- 


cal space. 


In classical statistical problems, our set of possible probabil- 
ities is parametrized by a convex set in Euclidean space, and thus has not 
only a metric, but also a differentiable structure. If we let oe repre- 


sent the parameter for the null hypothesis, then [659] is a measure of 


See ge a a are 
: : 

: . , - i ; | 

ditres ektT .aot4 mete aed ‘et 

bos. (TS) aolgeY fine pawttes Lieieegont ol xd bogena 


ahi f : 


ee a 
re 

ys: 4 = : 

i" . 


mia 


i : - - 
- fi 


A> od .-5.t ,yltasiol iva easkzedverpry ablsnmoiut ignetsrbaed (9) 


. ing) keg ody vol bhoki=dpe dnetolijva e et 3 Mt O = CalA)a - 


_ 


enoualines et mobsearotat nokapaka?r9020 @ 
(®.5,2) | (0+ CAT asdz At A OW 7 | 
7 


tf xtbnaqqe bus a wol 3 to ythxevaos eil7 mori awollo? too74 : 


gatseisnsi91tib ak spes add yotsyvansm nt , nol tamrotat PT rT a &.4 jl 


“soustath” o eovbesesm seuoe 6 at yxadidan moi solsudtatetb gets 


¥6 sosae 34% go oiazem 8 Jor ,<rsvewod Bt at eoIveSam qikitéadosg mr nt 


ei Yialise. 3! 2806 Jon .oirsamrye rwilsten at a1 102 arene aM 
para} 
abioupamt *. = 


: 
lo atied of villidadorq jo a aad aotish abo sno ToWewoH 


—yevno2 veld tagn07e at ponagtovaes ekds ou vnetaeon ‘ oRamite 
—1evaoo ad7 tavewod eferansg al 8 seen sagt tne 
= _ 
-tyoloyod B feve doU bin {S100 poager=abibsor ‘sielao ef or: oe 4 
7 
7 
a if ra acs “7 
a < i a PY i - a 
: 
—thdudrra eldtaeng Qo 29% w ne! aon 3688 jeje usiasels sake 2» 
i 39 9.4 r espe et eats 
paleiimentie a 


we 


Son eit eats)’ at.  vaegs ha rons 


setae 


26. 


how easy it is to reject the null hypothesis under the alternative 9. 

The local behaviour of I about op Suggests how "secure" our statisti- 
cal inferences regarding oF are. Since T[6 ,0] is increasing away from 
oe » it is easy to see that the total differential, if it exists, must be 
zero. Hence the curvature would give an idea as to how quickly I moves 
away from its tangent plane, i.e., how quickly does the discrimination 
information increase. Let D(O.) be the matrix of second derivatives of 
T[6,»9] at ee - Then, provided that we assume sufficient regularity 


conditions, it is easily seen that DO.) is Fisher's information matrix 


E29); 


as ~~. "as * 


~® aban nat set af 
+19ai 3683p 3u0 "euro! witesnbyaue @  tudde ann ae 
a = 
@oxt yews gatesstoni ei [Oe 37 Ms ad a an rt 
od Jeim ,etaixe jt 2 , Ip) ine¥ottEb Isjo3 sip rads 99% od yeao at 3h, 
seta T vlaolup worl 02 3b eabht af ‘ovky hluow sTysavIHD of sone ~~ a 


cotsmnintwath sd agob yao up wod i hae , Sol g Inognes amt ent a 


; —) 
to eavitevtash biosda to “xhgosa oft od { grr tad .Ssmeetont — 
{ a 

Vaizelugay Jnsiolitue smueezs aw taeda babtiverq ,nedT of” BT) (8.8 
Hy Reg? 
xivgam wolsamxoint a'asdetf at od tads ogee vitesse et 3f ,nnolithe 
a a 


Les) 


“v1 


CHAPTER II 


A BRIEF HISTORY 


1. Information and Communication. 


A concept central to communication theory is a communication 
channel, consisting of a set of inputs X , a set of outputs Y , and a 
line between them Vv. For the moment we will leave these three components 


further unspecified. 


Lek Our first problem is to define what we mean by the amount of infor- 
mation which can be transmitted, that is, the information which can be con- 

; k 
veyed by X. If xX contains n symbols, then we can construct n 


"super-symbols") from k copies of X , and as we would 


sequences (or 
intuitively expect that k copies of X contain k times as much infor- 


mation, it seems reasonable to use c log n as a measure of the informa- 


tion potential of X. 


However, let us suppose we have two channels, and in both cases 
x, contains 2 symbols, 0 and 1. A message is sent along each channel 
once a second. However, in the first channel, 0 and 1 are sent with 
about equal regularity, but in the second one, most messages are ls .and)a 
0 occurs on the average only once a year. We would intuitively think that 


the second channel would be transmitting far less information. 


It hence appears that the probabilities of the symbols being 


transmitted affect the amount of information, so that X would be completely 


= 90 


- 


‘7, 


cae 


x E a ie 
7 _ ae 


sna it mn in 
& bnbi, ¥- Rauqdeto Yo ded wy X) aagtk Io 298 6 30 
asnenaqnas soudy oners avast Litw ae Jnssom, ait 0% a ia 


toFni to, dmroms ots, yd shinee 360 enkiak o2 wi makdoxq taztt 0. EVE. 
~noo ad neo doliw qotdamroint $42 , 82 joi (Ssi7twane73 sd nes ste wok 
q souxden0s aso ow asda ,elodaye © entesns x* 4% .% 8 boxe 
placw pw ad fas, X 2@ epkgoo -A\ mori ("atadmye-roque” net 
vote) doum-an’ ebmld 4 nibtno> X Yo eskqoo A Jens sosqes 1 [ 
~quiro7nt 9d) to swessme as n gol o sey o7 sidpacenes amsew 3X er ; 


.% Yo feksneseg waka 


e#us9 d76d ol Sha ,eiénneds ews vad oy saogque au Jol. ~revawolt 
Lannaio dass gnols anve BE sgesnom A. 1 bie 0 ,elodieys S eakesmes, pone 
daw nse otn, £ bap 0 eisnnads sa7k? saris at ,TavawoH = .beo2ee 8 Sono | it 


7 


a 


6 hms, I 94s aagbaese seo .ote baoves alt ah sud .vstrefuges Koupa dwode | is 
joa Aatty ekovigtusnl biknow a .aBey » 9380: yino agerave 8A3. 90 SRvRHOD O 
ng daamTeine ae ad bluow tonnes baoves oa ; 


30. 


described as a probability space (X,X,p) . 


Claude Shannon in 1948 [39] gave axioms which an information 
measure should satisfy, and derived from them the formula for entropy (I, 
3.1.1). Shannon's axioms were stronger than the ones we gave in (I, 3.1). 
He assumed, that H , as a function of the individual probabilities for 
fixed n (the number of symbols) was continuous; and was increasing in 
n , if the probabilities were uniform; and essentially (I, 3.1.2). It is 


customary to use logarithms to base 2. 


Entropy is maximum when the probabilities are uniform, and any 
"averaging process" (essentially a convolution) tends to increase it. These 
properties are what one would expect if entropy is thought of as a measure 
of disorder. The concept had been used in that context earlier in statis- 


tical mechanics [40]. 


If we consider the product set Xe with probability p” ,» as the 


set of n-symbol sequences sent independently, then we have: 


Nise Beck wel eed? 307) ieee ie elec cone 


PutG- jee] eed Ves cuc 
|log., P'{s} + nH| < 6 Css) 


ee ats ae is the approximate probability, for large n , of each "rea- 


sonably probable" sequence of n_ elements. 


vorwaneiat aa dokdy smoker 9763, (sey 08 a 
3) quorges 203 \eiumro? ‘gH mada 09% a gia 
Ue »1) ab ovag oy fone. af3 agii3 nei naan 
wi asirlitdadorq lavbivibar oa ko GorIsdul babe e a » bewar 
ot gnkesg2z20! esw lain jevoualyass BeW (sloday2 Io-redes ada 
at af «(S48 .7) ybfebsnsess bos ;mretton S160. esting at 
,& samd) eg eimiatragol ano 03 < 


yoo bone ,avotimy ora aeglstitdatorg Sia wail —- at Cae 
seoiT . 3% senstonl of Bbna! (coi sutovies & clintstisees) “egao0'g 
Swen 6 es to sdguends 2k yyotIne Ih soeqes bluow ‘ono. Jesw — 
-atjeqe ml Taklse8 3x99005 iou7 at beau psad bat 3qe9005 ont 03 
«(087 sonndaas 1 


sidab hy 4 4ifLidsdorq diiw "% doe Jovbosq si} 4ebisnoo aw BI -_ 


revat aw aedd “ftaobasqsbat Jnee 29onawpse ioduye-a 20) 398 
; 
i 


e “E25 Emiavenb O<tO¥.nea¥ 7 
a iw) oS 
2 SiN pam, 2 > Pays 


la -y \ Qe 


al ray nt |r F{e7"T p04 
see ame 50 yan 88 SeLstdudong cats ti x a 


i 
= 


eka 


Leo2 If there is a one-to-one correspondence between the input symbols and 
output symbols, then we speak of a noise-less channel. In this case, no 
information is lost in transmission. The more usual case is where the link 
between the channels is stochastic. For each xe X » there is a probability 
distribution v(x,*) on Y . The situation is now identical with (1, 3.3). 
and we can define R(X,Y) as the rate of transmission. Note that the noise 


characteristic \V is assumed known. 


The supremum of R(X,Y) over all probabilities p on X is 
called the capacity of the channel. Coding theory is concerned with 


choosing symbols so as to maximize capacity. 


A | Many real-life communication channels transmit signals which are 
continuous varying voltages, and hence are not expressible as one. of a 
finite number of symbols. It would be useful to have an information mea- 
sure for such channels as well. Unfortunately, the obvious quantity (3.5.6) 


is always infinite in this case, and hence is not very useful. 
Shannon and Wiener [40] independently proposed the measure 
- E(log f) Gee 


(though with opposite signs,) f being the usual density of the probability 


distribution. 


Discrimination information has been generalized by Ghurye [18] 
where p is any finite measure, absolutely continuous with respect to q , 


which is oO - finite, and (1, 4.2.5) holds. ,Hence, the entropy of a contin- 


salon ons statly sdor noteobaecuns fo S3sY ‘orld ue (es a ano ow bas 
non! bamvees ak MW . 


et X no q eskitttdsdorq Ite vevo (Y~X)A Qo mumexque SAT 9 ye © 
itiw herreonon ar rosds gavbod .tenaads sd to yitasqso a3 hottes 
.¢aionges stimizss ot a8 08 eLodees a 


‘Al 
] 


exe totdy eleogte tlmens73 alearrads Tobinotnumm@es siit=Lesa at EE | 
6 20 sno as aldteessqxe Jon 946 eoned ban ,esgsdfov galyrev avout 309 
~son nobaemtoia) ne sued of Ivtbeu od binvew 21 .efodmys Yo -vedmuc etiat? 7 


v 


a oy I 
(8.2.0) yitsnaup evoivdo sid .ylstenuszoinl! § ilew es elsnnasdip fove tot ee — = 
-tujeey viey Jor ab sondd bin .sebo ebds at satelinl oyewls of 


siuessm of9 beeogorg yfonsbasgabsl, (04) se09IW ban conned? 
(1.8.8) (i s01)5 - 


qiiitdedoyy sda to yitanob evay ot gated 7 (cengte stteoqge dylw dgvod3) | . 
; | snosadiezetb 


[8t}siaudt) ye baxkLnxenog 908° et qotsumotnt aoldantelisest be) 
oe a9 Songer’ dst auounk Ine. xidnfoeds coawaioe 6BN08 a a assdw 
saab epatiadl colt a eaten = 0 a 


ee a ee ee _ oe 7 rit oa 


32, 


uous distribution p can be written as I[p,q] , q being Lebesque measure, 


in a sense the 'most random" distribution. 


Continuous entropy may be infinite, so it has no "maximum". How- 
ever for all distributions concentrated on a bounded set, entropy is maxi- 
mum if the distribution is uniform; for all distributions concentrated on 
the positive real axis with expectation \ it is maximum if the distribu- 
tion is exponential 2 ; and for all distributions with variance 0? Ape 3 
is maximum if the distribution is normal o? . Entropy is location invari- 
ant but it depends on the scale. It may be negative. Like discrete entropy, 


continuous entropy is increased by an "averaging" process. 


a The rate of transmission can be defined in the continuous case even 
when the entropy is infinite, as a discrimination information. cf. (I, 


3.4.2). Let £ be the density of the input signal, let 
p(x,y) = f(x) v_(y) 


(v being the noise characteristic, see section 1.2) 
u(y) = | E(x) v,(y) dy 


and 


then 


R(X,Y) = I[p,q] A in ad 


SE \ Dre 


~ 
: a M Lal af a ‘1, 
(AVR BIM + iageades qniad = ; teat ome ‘atl 
sauiaawaeit ‘« 

—woH ."mualxen’ of gad oF oe Ssiubiar od aes oie en 
-~Ixnn ek yqovsre ,39e babnyod & nO beawyiisanes aaoksudiverals fis x07 38ve 
no bosnysosonos anolivdiavjerb Lis tal jaro kau at nokaudinsenh ons 3 + 
-udieaelb eds) ti wmimbiem et 92) 4 dobsegoamte ore ala inex ovtaiog » 13 ay 7 


—s Ac) 7 


6 gonsiaav dolw enabjvdtierb Ls 103 bos ; A Ielanenogxs et sobs 
_ 


sl, 0 


_ 


7 


7 
© : a 
~travat fobssool al yqougad . ~D Lanrton el aotiudiazeth ods 22) aumrxem © £ 


-ovtsegan od yoo 91 .shaoe efi a0 ebasqeb at natn 


aris S75 youth san! 
~~ 


.aascoig “gotgersve" as yd Baenestont ef yqotdee aa aiatecl 
; -_ 


fove Sand auounticos sid of bentisb od ms colLeatmanayd to adey adt 
~)) do .nofsemoini noljenimiisekb 8 as ,sainbint el yqotors oi? 44 
gol ,isagie suquki sd3 io yitensb et oad i jot . iat 


Cow (x)t = (vexd¢. : : 


€. 0 nortsee 99% obstinate eeton its 3 yoked vv). 


a 7 


! 


eo we i Oday 


33's 


2. Information and Inference. 


Dial The concept of information in the context of statistics was intro- 
duced by R.A. Fisher [13,14]. He defined a "sufficient statistic" as one 
which contains all the information in an experiment. This intuitive idea 


is tersely expressed by (I, 4.2e). 


The concept was also used in the theory of estimation, as corres- 


ponding to reciprocal variance (of an estimator). The more dispersed is our 


estimator, the less efficient or informative it is. 


If 98 is a maximum likelihood estimator, it is the solution of 


OL(0) _ 
Tee a 
where 
n 
L= ) log f£y(x,) (2 siea) 
i=1 


“w~ 


£ being the density of each sample value x, If 8 is unbiased, and 


normally distributed, then we have 


wt ee (22) 
90 |e=6 Be 
971, ¥ 
so that we can use - pea as a measure of the informativeness of 0 
ele) 
We can also talk about expected information: 
= ay = & ee =n i(0) C21) 
I(8) = Eg [- 5 n ¢ ‘90 n ‘ Pals 


~ 
evant aew aobsakiase to sxeIneo add at ookremseiat. to 3qe2n09 en as 
san em "“oltetgata jmto tye” ge barttsb sit ASE, ex) roraht ht vd & 
nebt sutdtvini ald? .twemlzaqes op nt coliewuint six ile eatesaes. do sai 
(a8. . 0). wt Baseorqxe (iseted - 


-egyreo an ,noljomtzes Yo “Toamls eit ai Seay cele naw Jqpndos ailT 
is 
suo gt baateqeth aicm $fT .(sodamivas of to) sonetier tapomgioss 69 Bf 


ek ai ovicerrols! 1o daslotite aeel edd <0 


Yo nobsslos off 2b ob ,toshmidas beotiiedl mish ner 8 FT - 


i= (0) 56 . 
QF is 


(7,1-S) . 00) a? got j “J 
l=; 


ban ,basealda at 4 tl. * aulsv ~— divas to" yiiensb aia — | -. 
- —_ 
‘svat aw aed7 | beawdivanab ett 


' 7 _ ns 
| = : 
(S14) NS > EE 
d ¥ £., « < 
| 6 ‘ aes Me 7 


re 

; O- 1 eeunovi ania Ao a «a e- me 
a as ee a) me et a 
7 _ ve > a ae 


- wa oe im oo i) 


a 


a . 7 
' 


: 


34. 


Fisher calls the quantity i(9) the intrinsic accuracy of the density 
fo » and says it represents the maximum amount of information provided 


by a single observation [14, p. 709]. 


Under appropriate regularity conditions, we have the well-known 
Rao-Cramer inequality [35, 5 p. 477f£f£] which gives for an unbiased estimator: 


nN if 
> 
var (0) x 


nt) ORLA) 


Equality holds if and only if the estimator is sufficient and 
normally distributed, so that Fisher's idea of maximum information is well 
founded. (2.1.4) can be extended to the multi-parameter case [35]. Letting 


I be the matrix 


aiee fy 
- Fy (9 a8, 0=6 ere 
oo 
then we have that 
1 ewe (Zane) 


D representing the dispersion matrix. We have already seen the relation 


between the matrix I and discrimination information (I, 4.3). 


Zee 9 is a sufficient estimator, then we can write £, (x) = g,(t) = 
h(x|t) , where g is the density of 9. Then, 


Z 
92 log Ey (x) 0 log Bg (x) 


5 CBOE 


90° 30 


yitaneb oid Yo yopiws08 otentssu! aia 


babivory polssarsGial Yo, JnvemE feontzhe ant fe. 


HOR.) yond stam a 


awood-Llow efy sven sw yanokatbaos aire Luger asabaguaqus ‘robat : i) 

sotnmijes bavalday as 10% asvtg doddw [23tve iq © .ce)] yadteopeak renex-o8st 
a 

(6,18) rT 2 he 

ai a 

7 

bons Jostorliua 2 r0tem!teas sf¥ 24. ytno bos ti abiod ysitaups gi 

flew al soltsmcial auetkam to gebl e'yoselt geds of ,besuditialb ¥ fox 


antijet .[€&] Sens asaemersq- =}3luw sdt¢-o3 bobnoges sdass (}.0.0) .bebmic 


gky _ asi od 


a 
3 rah a 
(<,1.£ oe 00" 6 i 
& Saale 
asdt sveil sw medid 
17a ae 
: : - - 
I=. {+ a er - 


(3.1.8) i Tn < (ape = 


- va - | _ 
uglaeioy sii neoa vimoT Le oyail oW vetsacn soteradets aranannell 
-(b. a aid noljsmrotut nolsentmbs2e% ban i xtstaa 919 


° : 7 7 ' 7 
- (2) * -) 3 sala on ow ost “7031 sao, gtakhin sal 
- 7 - 


a wont. oO cog io ora’ at 
ak 


35). 


Bearing this in mind, we can define, for any random vector T_ such that 


its one-parameter density is twice-differentiable, the information quantity: 


I, (T) ey ( ) : (2252) 


This quantity has certain attractive properties, [34, p. 5]. 


(a) I, (T) 2 0 equality holding iff T has the same distribution for 


abi Os. 


(b) I, (T) < 15 (X) if I is a statistic from “XX, equality holding 
only in the case of sufficiency. 


(c) I, (1, .T,) = I, (T,) + I, (To) if T) and T» are independent. 


Ze Fisher drew a parallel between his information and entropy [15, p. 
47]. Thermodynamic processes which are irreversible are accompanied by an 
increase in entropy. Statistical processes which are irreversible (i.e., 


data processing) cannot increase information. 


264 A complete theory of inference based on discrimination information 

has been developed by Kullback [29]. In it, the discrimination information 
is used as a pseudo-distance. Central to his development is an exponential 
family of distributions "closest" to a "null" distribution of all distribu- 


tions yielding the same expected value for a statistic. 


ua 
2 os 

juts doe T woosev moboor yns 201 ,sabiobs ne 

10 oe 

'yvilinewp Golismyotnk any »sldelige1si3tl-soiws al 1¢ 


§ 
of gol 6 
(ReSve “a { c -) 0 - (P)g1 
86 
-f2 sq BE] /estdreqoq eVE2oe7I96 nisrye ead yahinawp Bhty 7 
Pt 

a 

40? motsucdivjelb smps od aad T IPE gobblod yilieups 0s (T) 91 (s) 
: _ 


. 8 Ils 
_ _ 
gniblont vitisups , % mor? obvatsage » ek T Ti (X) gt 2 > (T) a3 ® 7 
sWonatolitue to Seag sd3 nt vise : — 
-Imabasyebnt sx6 Tt Bas fa tk (xT) 5 r+ ( pot = ie ogi ei 


; _ 
.o ,2L] vqotaas ben nolsamrotat ek esswiad taklorsd 6 web ates _ 
na vd bstesqmooce 9's stdtaeevets® O15 ae ES oheserybomr a 
,*3.t) afdtexreye1i1t sin dotiw soasaporg tebtvetauge oma 
nos peusppat sexexonh Jonas. (antessonng 

_ 
We 4 


notsanrotnl nol ceived m9 ot epfisraink po io ques 
soLIearotns soinotniaeth ae it al im ao ia 


36. 


ee The calculation of a statistic can be considered as a communications 
channel as in section 1 which is ambiguous but otherwise noiseless - i.e., 
the measures ,\(x,*) are degenerate. Csiszar [7] introduced the idea of 
a noisy channel in statistics, which may be realized in the case where an 
observation has an error in addition to the intrinsic error of the experi- 


mental material. He calls it the case of indirect observation. 


Not surprisingly, an indirect observation can only decrease the 
discrimination information. If the decrease is zero, the channel can be 
called sufficient. If the decrease is less than e€ , the channel can be 
called e€ - sufficient. (However, if the information in both the direct and 
indirect observations is infinite, the "decrease'' is undefined.) The same 
applies a fortiori to statistics and subfields. If a sufficient channel 
does not exist, or is expensive to construct, and € - sufficient one may be 


adequate. 


3. Various Tangents. 


Sack Renyi [36] has considered generalizations of entropy and discrimina- 


tion information by relaxation of the axioms. 


By replacing the conditional additivity (I, 3.1.2) by independent 


additivity (1, 4.1.4) he deduces that the quantities 
Hg(A) = =z log (1) [Pcay]®) 
B 1-8 


Lefty Ur Fs (a1 i) 


the summation extending over all atoms A of AAs 


“| eee 
bt + - Sar ye | 


auohis=tnumes 5 06 b¥tabkeaod et nos she ‘ioe nobsstuol 
,S.f = eealsaion sstwrorto Jud wit obs’ a 
19 a9bt sd3 beowborsal [Tt] snsaded aus ois wees sa 
un seiw seso 9d) bt beskless a ‘yam dotdw ,ectietzage or pay 
~lasqxs 849 to torts pianitsal eft 0% —— nt 70172 78 oud MOSES 
motisviseds tosatbat Yo baw sits SB et ao oli Letisdes baal 


von 


ott seedrosh vireo neo nobtjavigude Jaorkbal nm ,vigatebrqrva Jou 
of a4 fengstio af} ores et aanereeb sig 3 .aohkssmioint notseaketroekb 
od mB tennsh odd , 2 onda ees! st sesoyoeh odo 3] .tnerokPlue betta. 
bre joo1lb edd ifn mt gobaparxotsnt 5ti3 it prevewaoH) | -tinestottige + 2 belies. 
ampe edt (.beatisbau el “saesyosb" sda .srinitol at enotssyreado sooxtbat 
lsopni> tnatolive sz Yi .abfetidee bas soizekiaae o% totsxo3 s eotiegs 


9d ¥a0 Sco Jasloritva ~ 32. bas ,JoutJaqos oF Ssviensqxs ef zo ,jebxe jon 2s0b 


~sniatvoerh bns yqorsns io vedi embtireiels batebtanos eet [8f} teat et 
rae att to noktexéiox yd notssarotet oom 


asta ya (S.4,€ (2) vikyisebbe Inaclathabs ons sutootges qm “f 
eotjtiaiup S43 dads essubab al COLE.) .TY nen 


>. , = 


cea Urge hp Gage’ —. ae 


wu 


hte) as oe * . % 


a “i . oh ie kite 1 seen ae 


a — OO 


sy 


also satisfy the axioms. It is easily shown that 


lim He (A) = H(A). (30152) 
Bo 


Renyi's treatment includes defective probabilities as well. In this case 


the quantity in (3.1.1) is divided by Phone. 


Similarly, Renyi generalized discrimination information to 


iL 
I,(A) = a77 log y q(A) eae se eur ers (e123) 
AcA 4 
the summation extending over all atoms A of A. This quantity again 


satisfies the independent additivity property (I, 4.2.3). 


Se Discrimination information was seen, in the general case, to be the 


supremum of discrimination information over all finite sub-algebras. 


An analogous quantity was defined by Ghurye [18] for any finite 
measure p dominated by a O-finite measure q defined on A , and any 


convex function f defined on [0,~] , as 


1,(A) = sup y q() aces B21) 


where the summation extends over all atoms B of the atomic subalgebras 


Buedis Ams 


be te » then we have the analogue of (1, 4.2.5) 
q 


1,(A) = | f POT edd ae. Cai a2) 


i ~~ ¥ ae _. 
AE : - mi 


a ; f } é 7 - : : 7 


* 


ue 


_ 


gana mote yLtens et JT snake aif a 
(Sef-€) | — (Apt = (A) ff mkt 
ae 


, blow et eetaliidsadosg syrssotsb sobutont asisntionss : 
(Bo yd beblvib et G12.) at qatraswp oda 


7 


os noLramyolal nolidsgimitoalb bestlerensg iyaed “ivaliate 


see5 ebgo wl 


n (AD4 re 
(€.1.8) .- 2. anna (Ap iy gol ag * ANG! 


arage yitinewp atAt . A Fo A \amoss Tis vevo golbreine solsomdus odd 
, a - 


(88.9 .2) vaeeaorq Wilwl 2thbe seobasqshbal orig ast tuts 


ond ad od ,Sesy Intendg ofa nt ,ne9e cow notysmyotn) optunatmbisesd +e 

re 

1 vy 7 3 7 : 
aor 

° a > 


on i] 
saint yim se? [82] sytudo yd baitkish eaw wo hLinaup svogotans cA 7 ra, | 


-ekudagte-due extol? Le xsvo nokiamreiot nolieminirsebh Jo 


eos Soe , A. no bentteh p sivassma ssinti-D » xd Besenimob a _ call 


BB. ten fe bers acd 2 noktomu} yn09 


- ( a —— 
- 
(1.8.6) (O35; hd Oa ‘que = NyE eo 
; Sap Aa 7 
- 7 —, ; _ 7 i ; 
7 , j : — Py ee : 
” aprdoyladue 24105 4 at lo 8 amore {Is avo poate. not sneme, ois szon 
_ _ ’ 7 : 


a > 7 ie 


3Be 


We have seen that continuous entropy is an example of this 


generalization. We will need it again in chapter III. 


She Two of the most significant properties of discrimination information 
are that it is monotonically increasing: (B < A => I(B) < I(A) ; but is 
strictly increasing only for non-sufficient sub-algebras, i.e., 1(B) = I(A) 
if B is a sufficient sub-algebra). It turns out that convexity of t logt 
is the only assumption needed to prove these two facts. Csiszar [7] has 
hence defined for any convex function f , the f-divergence, between the 
probability measures p and q , defined by (3.2.2). To take care of the 


situation where p is not dominated by q , he defined, 


ee Pye 
0 £(5) = 0 and 0 £(5) 


‘ et 
= lim, ef(~) 


= Sa (3235) 


There may be situations where some f-divergences are more mean-— 
ingful than discrimination information. It has already been mentioned that 


the neighbourhood systems defined by I on the space of measures need not 


IAC 
define a topological space. However, if both £(0) and lim TD. 
tro 


ake 


finite and f is strictly convex at 1 , then I, does define a topology, 
equivalent to the variation difference, the latter itself being an f-diver- 


gence with f(t) = |c-1| [8]. 


nallgeatrohn) odaumbattxanth: to mekiaseina sam Tange sae 98 LOUK ore. 

af wd 4 (AVL > 2 ABT <= KEY :aubkeotowt yilealaazonom ab 3t 
(At = (1 ,.9.8 conxdgg Lande irsiotitue-por 201 ibs gutter 
1 of 3 to vatxevinnn Sed? gue gems IT -(atdagis-doe Jnptobiive © ot @ Bf 


= 
7 
_ 


ea [6] teSkiad .09982 ows sée2 \evorq oF bebop aoLIqmess vido edd ab — 


ofS RasWied ,somagzevit~t a3, 2 aoldsqut xdvn0o Yue 10} henkish some 


of3 20 @rep eter of . (8.2.8) 9c beattebh , p bas ¢ wesunesm yak itdederq | 


gbantist si yp (a beseolmob gon-al \q evaitw mobreyste 


Gio baw 0 = @yaro 


5 
(>)is, = a, 
gu 
(GEE) , DBimtie - 
ory 
mere 


= ~s5 


<hem som 948 eadnegasvEb-1 omok StoHy enolseysie sd yom oandt 
yaad besokivem osad ybserla eed 71 .sotsswsolni noktenimtrseth nada Llotgnk 
Jou been eetuneit 10 o9aqe sdd no I vd beakieb ateteye boodruoddighen sAt 
si SPE ie im (02 soou 2 rove s9nae Sustactonos,« ontaad 
,(gologa? & sativh asoh 5 nati , 1 Jn xovnos ytdotzte ot 1 baw sdtald 
seit allele lake kt aS 


| |a-ah = (9)2 dd sonsg | 


a 


393 


If p is not dominated by q , then discrimination information 
is infinite. However, if an event is impossible under q , yet has a posi- 
tive but very small probability under p , we may not want to think of this 
"discrimination distance" as so large, as in fact the two distributions may 
be very difficult to tell apart by experiment. By appropriate choice of 


f , we can allow I, to remain finite in such a case. 


pad ped otek 
’ —_ ale om 


mt iv 


aids to cath 03 pri iati q ae ¢3 LENA aa _~— 
| oe ” a 


nile 


en enoksodtasetb. dua ada mS jogant ob ane 
16 wotos ‘sdorrqorgys va -dnombsqxe va Sanye. cigs ‘° — pas 


2 


.90n 6 dove al S9%ai2 abueos od y wotls map. 


CHAPTER III 


INFORMATION IN MARKOV PROCESSES 


1. Basics of Markov Theory. 


tak Let us now suppose that we have a family of sub-algebras of a probab- 
ility algebra A , which has a temporal structure. Let T be a totally 

ordered set (which is usually assumed to be either the non-negative integers, 

or the non-negative real line). For every t € T , we will suppose we have 

a sub-algebra Be » called the algebra of events observed at epoch t 

Also, for any closed interval [s,t] , we have an algebra of events Coa 


called the algebra of events observed between epochs s and t . AS we 


will be dealing solely with separable algebras, we will assume that each 


Be and each a t is separable. We will, in fact, assume a stronger sep- 
b 


arability condition: 


If D is a dense countable subset of (s,t) then 


Cc = Vv B e Ga eae Be 


This latter condition, of course, is trivial in the case that T is 


countable. 


If we have a continuous probability p defined on A , we will 
call the system (A,T,B. OC. mNe. a separable stochastic process. We will 
b 


say that it has the Markov property if for any t and s >t , we have: 


The algebras Co ‘ and C. 5 are conditionally independent 
b) 


She 


~dadorq & Io enatagia~duai hci ce ee 
vitedey «ad 7 dad) ,Siudvosde Invogacss Sadaisidy., 6 -aaaingle a 
e¥oged0l evissgen-acn sft sedate Sd oF bamvees yileuen at doine) Joe a 
sved aw S20qgque IlIw sw, TS 3 qisve, 107 SE A eaES 
. 2 (ix6qe t© bevraedo gansve Yo nidogls ads baltns , sf 
+ fg) B3a8ve iv soften fm ne oyed ow, [ae] Eevesgot heaolo yas aia ool 
awd. ¢) baw a aifsoqo nsonded bayisada e2ieve to aidegin sta) 6 Liao , 
does Jedz smueen Lliw sw aexdogi eldarsqee diiw vlaloe gntineb of thw - : 
-qwa 1sguotse 6 Smusea ,Josi mr ,fitw sw <oldeveqss 6k does bap 


meds (9,8) to tsadga sidaunio> seneb 6 ai G FT 


ef . Li vo =e, . - 
Paap. 27 e 
ae rh 
af Y seit 9ens sda nb tntwlas ai setu0D 20 -sobItbag | — aidt 
‘ — : _ 
7 7 


: J . 
a 


1S 4465 


“fftwaw , A no bentish 4 satin oon 3908 5 aved uD 


41. 


diven 0 <i 
& t 


We will say that is is temporally homogeneous if, for any s and 
te and “h “ts <-t) 
p(E|B.) = PCF |B) Bee ke Feb = Cin.) 


We will concern ourselves exclusively with temporally homogeneous 


processes having the Markov property. 


The quantities in 1.1.2 are called transition probabilities, and 
our separability condition 1.1.1 ensures that any event in any C . bh has 
3 


a unique probability defined in terms of them, conditional on Bo 


12 The preceding was a brief introduction to stochastic processes in 

the language of Chapter I. In order to make use of other results, without 
the need for lengthy reformulation, we will return to more standard nota- 
tion. We will assume that the algebras Be are all isomorphic, and hence 
can be represented by a family of random variables with values in the same 


space. 


Let (2,A,p) be our basic probability space, and let (E,€) 
be another measurable space which we call the state space. For each t € T 
let X, be a measurable function from {8 to E . Our algebras BE then 


are simply as (E) , and the values which X, takes on are our observa- 


bles. 


bos a ne 10% , 2) evoansgérod ylleroqmes ak et soda yee LLiw of 
n 
(2 > e8) d bas 2 

if ‘7 


a] 


(Sys £) » ge 29 42 28 Gy Ble = (Blea 


: et 
guosnauemet yilevoqess datw. cloyvieuloxs ssvisexvo oxsonoo Libw aW a 


7 if. 
.Viregord vodlreM ods gotved asessoorg 


bas ,eotstitdsdarq notiteneis belies srs S.I.L nl asisizesup sd? 
and 7 yoe ne 3nave yos Jsad3 assuens I.1.f aotiibacs ere. ie 


P of ao [anotitbnod ,med2 to erred oi bonitab ystitdudotq oupiny 


at 2s8e9901q oldentioods ot polzoubos7snl Iot+d s asw gntbeserq oAT 


qwondsiv. ,esines: 1a9f30 io eau sasm oF Ishro pl i 193qsH0 Yo ogavgnel od3 7 


=| 


2. 
-sgon brabaste som o3 orudes {fiw sw ,tolintmtotey yfagael vot been af: 
sored bos ,>bdqromoatk tin 916 eiadeate eld Jada omueas Iliw oW s 


amex odd mi saolev d3atw esidsisev wobnes to qitmst p Ud bstoseszqe7 od 


a 


pe 
(3.5) a has tai al cabinets ‘ohead a0 ad CarAd! hi) taf ~—s 
T 3.4 does 20% aonge wit outa bibo aw Ha babe speeiame, * | 
mtr 8 steels 200 a a Se, ie seater om 8 x of, 


nen 9x8 be we al tate eon " Iny i ban. “ ar 


42. 


Two processes Xx and Y, on the same state space are said to 


be equivalent if the finite-dimensional distributions are the same. ice. 


Dore ty atl ye Ree ioe SAL a Le) pens te E 


1 1 


p,ix poe ee ee In} is Paty E F Tei slantee (1e2 Ay 
J J 
We will be concerned exclusively with finite-dimensional distri- 
butions and their limits, so that we will essentially identify equivalent 
processes. As every stochastic process is equivalent to a separable pro- 
cess [32, IV, T29], our separability condition is hence no restriction. 
(Although our definition of separability appears more restrictive than 
Meyer's, I conjecture they are equivalent if one assumes continuity in 


probability. See [9, IL, Theorem 2.2].) 


Le Transition probabilities are usually expressediin terms of Markovian 


kernels, 112.) 10,733] 


P(x, F) = p(x 


| 

uo) 
= 
es) 


See Gigs rl) 


Given an initial distribution P, » we can define the distribu- 


tion at any time t by 
p(X, €F) = | P,(dx) P(x, F) (ivoe20 


hence we can see that the Markov kernels act as operators on the linear 


(£-8.£) 1° Lah = b.95 03 aa Mere bt a 

-ta9alh Lanolegendtth-o3tnta ake (levisuloxa: bamsooniog od aan ” 

tysleviups galanabs vitestpeaes Litw so . é 

-ovy aldezadee 6 of spalep tone: wt pa bi | \ ene - 
snolsoizses1 of aonsil at ey’ yilitdesnges) 700 par ses aan 

mids ovidobvioer ovtw anesqye Yikbidensqae to colshalIMd seo" cit LA) 

at vsiurtiines esmean sao2t yostevinps o7s yoo7 sioysapaes ZT. 

(.{S.8 wasxoed] ,JT ,@] sad syatit 


daivedse 26 eur19d ml ibsxsatqxe wae? 415 2eatitiidedery nottbensaT 
(ee OL SL) af 


iis ae 


q2.€ -£) | ‘ i on, jet ” 


43. 


Space of measures on E , which leave the subset of probability measures 


invariant. 


The also define operators on the dual space of functions on E , 


defined by 


(P,“£) (x) = | P.(x,dy) d(y) 


1.4 The operators fee form a semi-group of transition operators. 


That is, they satisfy the Chapman-Kolmogorov relation 
Po .PSert="P 5 Clive i) 


they take positive functions into positive functions, and they leave 
constant functions invariant. From (1.4.1) we can see that they commute 


with each other. They are of norm l. 


The operators can be embedded in a Banach algebra, in which 
analogues of classical algebraic and analytic procedures can be developed 
[21] including limits, differentiation and integration. There are two 
limits that concern us. We say that the operator sequence {T} converges 


to T uniformly or strongly as, respectively 


wh im alec aby eS 


teee = Tl ae Yor 2D (i. 453) 


D being the domain of the operators. Uniform convergence implies strong 


2 noymeattset Saceee 


OOb Gebyad 2 | . wats lv 7 = 


_ : . : 

| a : 
-eyeisteqo aol iene 90 Gueig-tnse A oad 1" 8) exosexage of OEE 
> Rotsere7' voRaomtedemenqad® silt GelIne tenty ek ae : | 


(1.8, 1) oe 4, girs ig a ; a 


aveel yod? bits <enotsvau? oviaheoy ojnt enotioan? svtsteoq was v " 
sJummon ysita Snel’: ose ney ow (£.8,L) éox? -Inaiteval anoiisayva sensi on 


dotdw ot ,exdsgte dosasd 6 at aa a od oes etosateqo silt 
baqolavsb/ad.0ea asrubsooxq at dytens ii atnadaytn ‘pateeals 20 
ow} S46 axon echasageam [bits not gat saexs3iib avkmht antbutont U 
ir y tf 


aegasvnos {7} Somat See? RETLNNe S ae eee SORE ees 23 
ghavtsoaqaet a0 <ignowe to vito? aw t 


‘ a -_ 


a ie .s 


44, 


convergence. 


Semi-groups of transition operators such that Po = I and 
that Be > I strongly as t ¥O are called the Feller semi-groups [33], 
and are in fact (strongly) continuous everywhere. We will deal exclusively 
with such semi-groups, and in fact we will suppose that the transition 
probabilities are conservative, which is to say they are non-defective prob- 


ability measures. 


If the limit 


* t 
A = lim ——— (12454) 


tv0 


exists uniformly, then we say that the semi-group is (uniformly) differen- 


tiable, and from this fact follows the Kolomogrov background equation: 


= A* Pp : (1450) 


In fact, the semi-group (P ") can be represented as: 


AUS Bt SS aes (ls o66) 


* * 
A is called the infinitesimal generator of the semi-group. 


If the limit in (1.4.4) does not exist uniformly, but if it 


exists strongly in some subspace C dense in D (GER 


Ee oo YVfec , Cieaers 


a oC 


be he . a 


* 7 

bos It» ot ged Hove exoweisqo noliieneys 20 aquorg-lmse 
7 

- 


, (08) aquorw-Tese r9ifst add tatlos sin 0 + 1 ae ylynorie T+ ‘= Jans 


Yileviavioxs lbeb Litw sw Lexadwyisvs avounttnos (ylyaor ta) 8 nk = 


-equorg-lmse: Howe ae) 


felslensye of4 Jady seoqque ILiw ow dost at bas- 
7 


-deag svisoei46b-non 93a yeds ‘ysa od @L tote ,evidevisenc> ‘S16 censor 
_ = 
.S9Tyeeset 


atwil oda IE 


* 
t- 4 7 
* = 
(dbf) | Se et =k +: Om 
O's : 


<n979%21b (yimrotiqn) ab quorg~iase 44 tedy yee ow neds ,vinrotinw aad 


a” 


indkteuvps baworgieed vorgomolod efz awolfoi joni alis mori bos vofdaka 


_7 
- 


a * tb 
(@.6.£) : yl Ae 


tas bodieestqey sd mes ("5 ) quory~ man odd , <aant at 
- ea 


“a 


. = ' 
=) - _ : 


quorg- basa: sold Be Scorcuent ariel ‘eld -bottss er *y ane! 


Asi OP if os 
a : in ny yb I) a ii 


3% 24 gud arson picid 

7 a - : 
a: sos) @ ak sens ie 
a? A Z 


: *, Ay gk bes 
(a.daf) » 1a - eh i=* 
' . : Ozd 


ae 


45. 


: . . 2 * 
the limit being in the norm), we still say that A is the infinitesimal 
* 
generator. (1.4.5) holds still [although 


Pt 
Tre is now only a strong 

derivative] but (1.4.6) need not, since AS may be an unbounded operator, 
so that the exponential function is not defined. However, the semi-group 


can be approximated by semi-groups of the form (1.4.6) [12]. We also have 


the inverse of (1.4.5), viz. 


* * C # 
P -IT=A 1 ds A Cie arc) 


This relation is true on D if (1.4.4) exists uniformly, and is 


SLLe true on Ce tf only (1.4.7) holds J) 10521) 


A Bsa, An important class of Markov processes is that of the "step" kind. 
That is, the process remains constant for a random length of time, and then 
jumps to another state, where it again remains for a random length of time. 


Only a finite number of jumps occur in any finite interval. 


There are three limit expressions which figure in the study of 


processes of this kind. 


lim P, (x, {x}) =] (Cb) 
ty0 

P Gx, tx}) =al 
15 eae ea = A(x; x) CheSve) 
tO 

P(x, F) e 
lim —————_ = A, (x,F) 1d at Dy i oat Cio) 


tO 


7 “ar TW ; an : ~ 


om 

feminsztnrint ofa ai as “gerd we iibse aw Taal: Ts i 

*Gh ; . Uys, iv ; 

gnorse » yino won et ob dguodt{a) Iftie abLod Get), -7e on9g ; 
,rojstege bebnuodm ne od yen ®,  gonte -sou boon (0.641) aud Towser " ; 


quorg=imet oda ,yevewoll =. beatish Jon et rokisun? islamsnoqss ad3 sais oa j 


ay = 
ne ag: 


svan opto OW «[ SI} ).£) mol sft Io aqvotg-imse ye basembxovqas wa me 
"ere (eed) 36 oes of 
7 7 
7 - 
(a. o. 0) . ab oe } *h - { = "{f — 
) ; i 
dr bos, vlerrotiow etakxe (4.0.1) 2i 0 no euid at notdaloy ekdT a: 
n ~ 8 


{18 ;0L] ablod (7.4.0) eleo 2k 9 no saa Ehtze 


bola “qoaa"” ods to tadz 2 eseasoo7g youre! +o aanl> TnAd2oqml BA i tt 
Fin 
nord ban .ami9 Yo Wognsl mobasy 6 201 Inateuoo eatsme1 ea9207g sd2 ar oat 
a 
omits 26 digas! mobiax » 1b eniowet ntcye 31 ezerlw <otsda tsdjons od sqmt a 
; AP o at 

:— 


syistat ajtolt yas nt tiioD0 agqawf to todays s3tntt » ylnd— * 
a al vw at 


Yo ghuse ad3 nt saugit dokdw enoteestqxs Jim! sets 945 ‘sxedT 


ie  Chntel atta Yo “7m 


(£.@.£) i (ix) x) 41 —. i? 
j —— 
2 x kta. DD, hy 7 a 
£.2.f) —— ¢ * sx) A GE MS 7 
. : - o 7 4 - 


_ 7 
-_ 


eed Sopa t - hs 
‘Sunita 
Sir et 


< od - : 


46. 


Convergence in any of the three can be pointwise or uniformly in 


x , and in (1.5.3) the convergence can be pointwise or uniform in F. 


(1.5.1) expresses the fact that if Set Sa then for t  suffi- 
ciently small Xe = x (almost surely). It is a necessary condition for 


the sample functions to be step functions. 


Uniform convergence of (1.5.1) is equivalent to the boundedness 
of A, and implies uniform convergence of (1.5.2) and uniform convergence 
of (1.553) in” fF . ~ Furthermore, A, is in fact a measure, so that 


A= Ay + A, can be considered a signed measure, for each x , with 


A(x,E) = 03 and A, (x,F) is a measurable function for each F [9, VI. 2]. 


If in fact the convergence in (1.5.3) is also uniform in x , then 


it follows that (1.4.4) exists uniformly, and 
A £(x) = | A(x,dy) f(y) Clee) 


(see appendix 2). 


3 
Even if (1.4.4) does not hold uniformly, the operator A will 


be bounded provided that the function A, is. 


2. Two Simple Examples. 


We will now consider discrimination information in a temporally 


homogenous Markov process. 


PS oe 


- -_ 
_ 


nk /igteeretttan 30 Setwinlog o¢ bao sara sif9 20 eet eter 
.% at wotim 10 pekwintog od aE> sachnjebene sal (6.2.1) ball i 


a 


~fitus. 7 fe? vod » x= a 11 sald Jost om ra sd Ma Rag 


16% re yiseasosn = at 31. .(ylerwe: Jacniz) = - pS ime. 1 
rasa ee ysie ad o3 enol soma olquae é 43 

; =~ = 
a 


zesdbebaved so 02 stasleviogs.et ([.¢. i) to -somagrioumos arco? kaw _ 


7 


9a gravace mzoile: bea ($.¢.0) le Sonsg7sva0c2 mio? be aallqui bins JX 7 
’ 


jaf) oe ,stiénen & Jan? qt et -oromtsdsud . © wt (€. 2 -1) to 7 

vue 
diiw . x 565 ol .avlenem hacgte & alae od mao ied a= J 7 - 
ive 


-{S .1¥V .€} J -tlops x02 aolkjonot sidatuessa 6 ot (4.2) 9h bes , O* (uh 


wot , x ol morim oale ab (£,¢.£) at sonagzsvacs eda dost abl 22 a - 


bas ,Vlurmoita ateteo (+, at I) ted? evoltok 22 


(a,2@.0) (vy)? (vb 2)A | > (ays os 
=~ 
- q 


(2 ibaegge 99s) 


‘ 


i iiw "A toJjetsqo silt + viaro% Lay Sik ‘dor asob (6.6.2) 22 oevd a) > a 
‘ — 


ak oh agbiondt S49 duda bebivesy bebmved od 


vLLexogmes s at no! searaoint aeons) 


7a me Sia ft 


veblane 


47. 


Let us suppose that we have a separable stochastic process 
(A,T,B 4C. oP) - If q is another probability on A , we can consider 
I[p,q] , the discrimination information in favour of p against q 
contained in various sub-algebras of A. In particular we will be inter- 


ested in the information contained in the sub-algebras B, and C 
fo) 


t at 


Csiszar has considered discrimination information between two 
Markov chains with different initial probabilities, and the same transition 
probabilities [6]. Not surprisingly, 1(B,) decreases with t , in fact, 
if the chain is recurrent, irreducible and aperiodic, it converges to zero, 


hence proving ergodicity. 


We will take the opposite approach. We will assume that the 
initial subfield BE is given, and will be concerned with the discrimina- 


tion information between two families of transition probabilities. 


First, we will calculate the discrimination information directly, 


for two simple cases. 


2A Suppose we have two Poisson processes starting with Xx cota 0 Tepe ec 
probabilities p) and p,  , and with intensity parameters A, and Ay 
L 2 il 
respectively. We want to find the discrimination information in favour of 
P,. against p) (i.e. in favour of the hypothesis ) = Ay against the 
iL 


hypothesis A = Ay ) in the subfield C, . 


b] 


(n) 


For any integer n , let are represent the measurable set 


fee xe eee x(t Peat (2.1.1) 


af i ne nae 
maar o9 Sorts 

ows meowied aofzsatoint io} taht xoeRb ne ent ie 
molglens13 smse oA bras sasuimiions Tavalet “gnevet it dxtw eatndo sand 
-tost at, 0 datw isha J (pr VWigaketaqrve Jo “fehl 


(0793 Of esgtayand It volbubzoqs bare aidtoubsys! ;jeotapaes et mbeds ods w 
¢akatbosies sro ame 


-teini sf Litw aw ralwoiogs 
= bas 2 chika 


ef? tad3 amuses Lbrw 9W «dosorggs sateoqqo sd aisd Ifiw oW 
«pnimfvoelb 9f3 3iw banreonoo od Likw bis .osvia ot oe brottdue fatsink | pe 
-astailidsdoiq fotstensii to aoriimst ows neswisd nol sumtotat + sola | 


efdoeith nolsamsolot nolisolmtizetb edz ezsfvolso ifkw sw jerk | 
.asens slquie ows x02 


a 

ddtw , 0« X ASEW gnlsaRde esseaI07q noKAtoT Oud aved sw seoqque — I-S 7 
ch (bas jh  etesmmnysq Gitensiok Agiw bun, pq Baw Ke eatatitdedosg 
to wove at woksemtoles nokisnimtzseth adz bari 09 snaw OW .ylavizceqees | 
ofa tantage A = 4 abwartdoqyH ora Io move? at ad ha 3 onteas bY _ 
jig eae Si BE sw  ntetonet 

ei a 


a) 5 al yn 
if sag pe he sous ae 8 7 


eae cw De: wh a2 - 


48. 


and let ne be the O-algebra generated by these sets. The atoms of 


this algebra are 


(2252) 
; : n 
where ie ol) =e No 


Now, because the process has independent increments, we have 


(n) = ae rt Jy 
Peace ame se texD ora) Gare steer Cee sy 
ie n i=l 
Hence, 
By cs 
(n) n) 1+ 
I(p, 5p, 33°) =) Py (C8”) log] ————— Cin 
BE, na) aed p, (co) 
ia a 


ng 
(A,-A,)t + A,t log % 


Note that n does not figure in (2.1.4), so that using (1.1.1) and 


(1,4.2.6) we obtain 


h 
TP) >PA So, me (A eA tt AS tolog eae (eles) 


tea? Next, let us suppose that we have two Brownian motions starting at 
the origin with differing diffusion rates a and b. Again we want the 
discrimination information in favour of the hypothesis of = at against 


: 2 ’ ; 
the hypothesis oO, = bt , contained in ae: : 


(S48) 


avad Sw ,ainenssonk Jusbnoqsbak wad seso0sq od? suuso3d WOH 


i 4.4 “3. hy fn ; 
i ese! re See r =. (0). -¢ 
(EvI.3) Le | =I5 s t ks a ) que oe =< ( gs? 


aE) |-————- ) a = bi av 
. Fos ren Os : ar 7s) 
; ah oh ~ 
od 
a 7. | 
< gol 3A + alee oA) = 
ed 


base (L.f.f£) gnhav Jeda oe the I.S) ub sxugt? ton ss6b a Jady esol 


fa hi 7 ainado av BRE) 
i 


lisp 7 ft et ie vr Fs eae ae: . 
. 4s ai a Sha 
j ie an 7 - | 7 rik 
as enetjom asinwoxd sh P spa saoqqua au aaa —— 
ie: me 


5B 8s aasaiancsalend oe a 


eee 


49. 


Let ji) represent the algebra 


The discrimination information between two central normal distri- 


butions with different variances can readily be calculated as 


2 ye 
Oo Oo 
2E2 
L[9- G7 )i= 2 (log al preyils. 1 (P0210 
es 2 2, D 2) 
orem 2 


hence we see that the information depends only on the ratio of the variances. 
Thus, the information in the sub-field Be is equal to 


fs 


1 a 
5) (log 5 + 5 


1) 228 2) 


zy) 


and in it is equal to (using the property of independent increments) 


n a a 
2 dog 2+2-1) . (255) 


Letting n+ © , we see that the information in x Ls antinite. 


3. A General Formulation. 
eal Let us first define some notation. 


For any t let (t,x) be I, (B, |x) , the discrimination 
oO 


information in Be conditioned by Bo . It is the discrimination informa- 


tion between the transition probabilities with lag t , and is for given t 


be . 
‘h ~— oy a7 


atdagis sils sageo1qe7 (My ee 
n ; ; oh 
A oY . Peary) 

14 : 
wie Pr 


-~bygetb femron Ieyinsn ows meawied notsaortois! solssnimtzsetb ait 


7 


an beisfuolso sd yitbeet nso esoneirav JasTati ty astw 


ee ; 
oc. 4 ae 
(1.8.5) {Lh = = +S gol) t= (9,9 J 
<. S $ Sore 
g 5 n> 


o3 leups ak 3 bieit-due efi of aoktamrolnl sft ,sodT 


a ren B , B a 
(SaS,8) (be ata gol) 


(£.5.5) . (Ft Zao =z ann 
co) - 7 se | 
a 
: ~ an 
.siintint at 3° nt nsolaemroint aia dsdi ss¢ ow, 7 * gatader r 
> 
~~ } a 
wien 
i | al + suahos joi ane ig oa 4 
7 a fon 9s 
- Hen i eae oe Bane 
7 7 ~ an 
molisainbroalb oft cing +3 oe je | 38 t * 
7 ad i it a Wicd. 


Passe nama atl ata > a xd nonohat sacs a 
ee an a aa 
) ate is pees aot as sis reroalaests ede 


~~ ' ‘a 
7 


- 
7 


503 


a random variable on B_, i.e. a function on E. 


Oo 
By W(t,x) we will denote Ip Cee _ 1s) the information in 
> 
Ce é conditioned by Be - It is also a function on E. If we know the 
initial distribution T , then 
IC ws) S918) ae | pCt,x) dt(x) . (35169) 
Ont Oo 


Obviously W is increasing in its first argument, and 


o<yp 5 (Sa 2) 


An important case is that in which  < © and equality holds, 
for then the observation at epoch t is as informative in discriminating 


the two probabilities as all events occuring beforehand. 
We noticed this case for the Poisson process. 


We will require the following regularity condition on 9: For 


every t , o(t,*) € D , the domain of the transition operators. 


In order to calculate 6 directly, we must have explicit expres- 
sions for the two sets of transition probabilities Pe and Q - For many 
concrete processes of the step kind, the parameters are defined in terms of 


the intensities defined in 1.5; and expressions for the transition probab- 


ilities can be very cumbersome. We will derive an operator equation for 


W , which does not involve $6. 


. F go aotysaut 4 aly 8 


wt aobamumrotot of («| , wet stoneb aa (e,av 
sis worl awit .2 gp uieetgtag etn a By ‘é ce enti 


(4.558) » eh (x94 | CBE = Pere 
ben ,Jneupra tev? ati ak gntaasioot et Y ylevotvdd 
(364.0) . 7 eS ae 


<ebion eotinups bos. o>) dokdiw ot 38d3) ai sas Inerroqet nA - _— 
gattentmixzseth of sylomroial as at 4 soaqoe 3a nolsavisedo saz asd? 202 


fae 


-eesn01q foeetod siz 10 9209 aid bastion ol es 
5 


.baadexoted.aciiws20 etusvs IIs as extikibdadosg ows 


19% «=; } ao molttbaon yiivnivgsy goiwollo) sdz srttupst [ile aw ie 
BIOISTIYO noithens33 9413 ta alngoh sd3 , 73a (+1), 2 7 


Ls 


~897gKS skobtoes avn eye gw ,visoerib 9 sieluoles of aebr0 oF 

une 10 a 3 oka ttrdadoay qatiteass3, to e9e ows any x0} es 
Ye omnes ab baniteb sae. axsz0in's6q od3 yhokd qosa: ‘ft 20 sooner 2 
sp anaes Was. np hire 32 ik Sunttob 


S1L8 


We will maintain the following notation: Be will be the transi- 


tion probability of the process under study, its associated operator will be 


* 


x 
P. » A the infinitesimal generator. If the intensities of 1.5 exist, 


they will be denoted by A, and A, » A= AS + A, - For the process against 
which we are discriminating, we will use the respective symbols oF : oh A 
* 
Bo 4 By ; BL pe Dims 
Bie Let us first consider the case where our temporal space consists of 
the non-negative integers. 
Theorem 1: 
bel i 
W(t,x) = PL w(1,x) ; Cas2e8) 
k=0 
Proof: In this case $(1,x) = ~(1,x) then 
vB 
W(L=) = be x) 
a k=1 ie 
t-1 t-1 
=e Tee VboB | Git, (B| vB) 4x) « C22) 
Spat 6 ets 
By the Markov property, we have 
t-1 2 
Ip soclece Bye) = sp Sela ea GL xy (See 53) 
0 = 


Thus 


ae eer”) CUCU 


os 
le | | 


a 


7 7 


— 


-benerd off ed Tlkw 4 snotteton griwoliol ods atasaben mane 


7 
2d Liiw 1903s71sqo beanloctes est ,Ybute rbaw ee9n0Tq sila to ait 


_ 


-selxs @,1 to agitlanoint ony 15 To stsa8g Lombeattating able % y = a 


aniegs aesso1g sia rot . (A+ A Ay ;A Uae OA yd begonsh od [i kw yor: 


“nes 


“~p ; Pi 2lodmys evisjooqest sit seu Iilw aw lta came 87s — 


re _ 
io si2etanco- songe Lstoqmsd tuo sisitw seso of teblaaoo Jezf? ay ded 7 S.€ 7 


-#t9gesat ov Lssyen-20n ont is 


ee a 


ti mexoodT mo rosdT 


* {= 
(ive - “Gt 8 f= tx, 3)8 


Ua, 
#2 (x 04 = (Ly tda oT nas 7 
nats P = (yes Seno ain? a 3 
7 a . 2903F =) 4 
3 
x ry ¥ ) aft on (xia) 
f= 9° 


1b | I-32. 
(S.S.0) . (e,8 Vv 1M) oI + fx |S BV") gh 
. tex 7° - * f= ot 


a Es 4 _ 
c. 4) 7 Lea: Est (ebay js Ps ae re a she 
me a _ ae , - | 


526 


* 
W(t,x) = W(t-1,x) + Ae | w(1,x) (3.2134) 
(3.2.1) follows by induction. 


Corollary 1: If w(1,°) is constant, then we have simply (since the 


* 
Pe are transition operators) 


W(t,x) = t W(1,x) : (352.5) 
Sigs} Now let us consider the continuous time case. Denote by of a the 
n 
algebra Vv Be . The process restricted to this algebra can be considered 
Fail cab 
a discrete-time process, with transition operator oe . Hence we have, 
Erom (362.4) 
fie Ce ay e PoC, =) (Grae 
ee eka ie ct pe io aes 
fe) k=O. &= 
n 
We have that VG ie oA by our separability assumption, and hence 
n=1 z 
i G = ‘ 55 
lim Tz (G | x) = (t,x) (Sr 5-2) 
n e) 
1 ee * 
Let us denote the operator — » BE eby es =) Thens(3.3.2)) 
N 4-0 kt ni, 
n 


becomes 


Hen [a Se oe, x). (53.39 


Saad 
a | sons do 


7 > . 7 ; 
ald oonte) ylomte wine Sais siandeaoo et ct it tL gaat a 


asioesoah sotanenean sm 


(@.4.6) i Gee Ba = Ce4 


ae ’ 
sit a vid Sjonmad .9285 smie Buourtinos $43 aebtenoo, eu Jol wou fe 
' i a zy - 


housb¥enos od nao sydegle ekdd O29 bs¥olvjess atgoorg say . ae ¥ sidogis. ? 
Lew ae 
=] 
: ae! 
eet au sone ., a soietsyo notskemesy dalw ,easoorq sty-eserzelb BO 
COS.) work + 
. Pm {~a : 
(12.8) + ese ast ay - 3 
fi 
soto bos ,nokjqmiieas yiilidsrsgess sa vd 1? « # v 
. me Ley 
{(Sv6.€) » Ged) =|. Fh) gt mbt 
: MDG en 
T=1 


acoeri wadt oa vd al i" eat TOJBTSO, a3 9toa9b au aod . 
he a 


53) 


3.4 Before we proceed, let us define a concept of infinitesimal informa- 


tion, which we will denote by L , by 


L(x) = Te CUE 
t¥0O 


whenever the limit exists. 


Now, the operator Sn + converges, (since the semi-group is con- 
3 


tinuous) to 


Hence, as all the operators are continuous, we find that 


= 
oa 
© 
* 
Sot 
Il 


w(t,x) = lim (Gls) 


n-co 


t 
lim oe t n oC 5) x) 


ro y 
t * 

=I aude 6 3) (S782) 
0 s 


provided that L is defined as a uniform limit. 


We enunciate the result as 


Theorem 2: If L , the infinitesimal information, is defined as a uniform 


limit, then 7 
a * 
Cl, ee | ae dsai : OS aed 


' / / ; Ml me ivi f 
7 7 _ | a me i. 


t. a . | “1, 7 
se Laine ighodes & aaau in 28, et 


ud © a vd % onok ith \ 


=) 


(b.b.€) Sh apt =a 3 
4 


-etetxs timdl ana 6 


“noo wl quoxg-tmss ods gonke) ,eagisvnoo , ne Tote1syo afi ,wor 
© 


(x | 


2 ame eee Se 


oe 


DOF , 


. 3 
(S.b.06) ¢ (x).J ab “a | 
~ 0 


Jimtf aso} tai 3 ee bowtieh et J sad% bebhvorq 


; | 
26 Jiuvea afd sjntonvas oW 
7 - 2 


Y 
re : . : - 
serotteu 6 is ab ;nabsemtotuk fanteoatntInt ada, 0 HM 
7 1 7 7 _ i i »_vus 


} 
" _ ee a, 
pa F an, 


A) _ ' a, + a aid - 
oe je 
§ : 7 


_ eo 
9 _ [? 2 


ear 


54. 


Ge Suppose that we have that the function $(t,*) is constant for 


every t . This will be true if the following condition is satisfied: 


For every t € T and every x,x' € E , there exists a (measurable) 


permuation 7 of E - such that 


P_(x,1E) = P.(x',E) 


(Gey, 
Q. (x, TE) = Q.(x',E) . 


The aforementioned consequence of this condition is easily veri- 
field from the fact that a non-singular transformation leaves the informa- 


tion invariant [29> Chanter 2.Corr..4. i|. 


Condition (3.5.1) is satisfied in particular if the state space 
has a group structure, and the process has independent increments. In this 


case (3.4.3) takes on a very simple form: 


Corollary: if (325-1) is ‘satistied then 
W(t,x)) = Lt : Cieiee) 
S26 We now have a theorem which ensures the existence of the infinites- 


imal information. 


Pe (=<, 8) 
Theorem 3: Lieelimer .0ke4 x) elim OnCxt sy) =o) hand if) both 
ty0 t+0 
Q, Ox F) , 
and ———— converge uniformly in Fc {x}~ E for every xeE _, then 


ie 


ah amass ot aaa’ —_— “at 
‘shottatane a notszbaoy grkwolfot ois It ouxa bd LLby | 


dexuensm) » wselxo s19Ht , 33 1K,x Gove bay TS 3 ys a ina _ 
jada dove 2 to 7 


(4, 'x),9 = Gra) ,9 


a 
~baay yfleng 2! cotatbnos eidy to Shmaupsevco banoliconstole saT a —_ 


—garetnt sds do titt oat sneered talégnia-cor 6 Janda 2287 sd conan 
i 


“[D.) .2209 8 te2qa (297 sastzaver WOH A 
a 
I, 


7 - 

; a 
sooqe ozn3e add tf aalustireq ot pbabtabiaa ef (1.2.6) moisibned) — — : 

aif3 al .e3nsmeront 4nabitaqsbnt esd eascorg oli bas ,9Iedou'Tee quOotg 6 esi - 


7. 
timi02 Ssiqmie yisy 6 no seded (E.0.£) 9eno" 


nol) boltetiaa ef (L.t.6) 22 syzektora) 


{$.8.2) , at = (%.3)4 


7 _ 


es ban be pane on ? 
per : 


5% 


the infinitesimal information L(x) = lim Mee) exists and is equal to 
ty0 
A(x, {x}) =a B(x; {x }) ae T[A, (x, +) ,B, (x, *)] 


Proof: We have that 


P(x F) 
o(t,x) = a ) P(x, F) log Q, GF) 5. wh» Linite. 


The summation extending over all atoms F of A. 


Because of the separability of E , we have in fact a sequence 


a * & , such that 


(t,x) = lim 1(P,(x,*) , Q.(x,°) 5 AQ) 


neo 


Without loss of generality, we may assume that {x} ¢€ ae Vcties 


Then we have 


(t,x) = Lim T(P, (x,°),Q, (x, *)5{x}° AL) + TCP, Gx, +) 0, (x5 Cox} 


noo 


o, (t,x) + > (t,x) . (3.6.1) 


Now 
lim ts Nb PaO Gena ae ane. 
tv0 
P(x, F) Px, F) Q, (x, F) 


aes ae Me ee) 
6 ts t E 


G 
the summation extending over all atoms F of {x } A, 


: Shank ia ee ae 
o3Eart - Tn p go! (Tex), a aa (x, 3)¢ 


dh %6 49 Amos Ile xsvo gokbrodus noksemmue 2 


a 


asd a es 
3 we Se 


$oqsepse © joat ck sve sw, J to yatliteasgee sf3 To seveoed 


ey CoM; Grated gP ; id ia Ian = (x,3)> | i 


- ay an 3 (x) d8d3 omiaar yom aw ,ytiietoneg lo enok tuodsiW is _ 
: ie @ 


Cl} Cas bead a4 A PDE esd Oday a oak 


(1.0.6) « (4,7) 50 + (x13) po ” 


CA hh sCr x) 04 C+ x) 90 2 3 mt : 
fi {x) ( eer “oN - roel! ~~ 


(a,x) um Ged (t,x) i le - 
Crees A Sous" "ee =) gol af . a, 
: - 


56. 


ns T(A, (x,*) 5B, (x, °)5A_) (3.6.2) 


and because the convergence is uniform in the elements of {x}° E we may 


interchange limits, so that 


Lim t $) (t,x) = Lim (A), (x,*),B, (x.°)5A_) 
tyO no y 
= T[A, (x, *) By (x, °) ] 5 (35623) 


It follows readily from 1'Hospital's rule that 


Ae Oe Cea a (aor eee BE Caen 
tvo 2 


Hence our theorem is proved. 


Ja 7 Equation (3.4.3), though interesting, is not very useful, because 
its right side could be rather cumbersome to evaluate. We therefore pre- 


sent another theorem. 


Theorem 4: be @.4.1) holds uniformly in x nd if Loe C , 


the domain of A* then 


Gar ae : (3.7.1) 


Proof: From (3.4.3) and (1.4.8) we have 


* * 
A wW(t,*) = (P. - 1)L ° COs Face) 


Gg Arcus gba ak = GS aE 


- 
(b.«dv€) . [ C40) Be 28) AVI aad = =, 


7 a 
jada olor a'Leadiqaoti'l most ¢liboes a" _— 


vn 


(d.d.2) . Gxb,x)a — URE, xa = (#.3) 90 is os 


-boyouq al merosda tu0 soaaH 


7 4 Pe od - 
sausoed ,[visel yrev Jor al ,guliesiosnr davods ,(&, 6.8) agtsaups te 


~8tq sTOlareds aW .stuySeve oj amowradmus sedjet od bluoo obha adghtejt 


, Dat Qt hna’ x ak ylerottay cic (£.).6) a set 


a7. 


From (3.4.3) we also see that 
ob. pty ; (eg. 3) 


Combining (3.7.2). and (3.7.3) we obtain (3.7.1). 


: de @i5)..3))s holidsmuntisho rmilys eines 
Corollary: ibs ae and B, are bounded and ( ) ; ’ 


then (3.7.1) holds, where L is given by Theorem 3. 


(0.8.8). - 2 
(e.8 €) “§ ‘ a3" % 


x nt yleottnu shied {€.2-1) bra babmod oss f bas (A 2T sealtere2 
.€ wenxost? vd cevig et | sew eblad a. 2) 4 


REFERENCES 


[1] Aczel, J. "On different characterization of entropies", Probability 
and Information Theory. M. Behara et al (ed.). Springer-Verlag, 


Berlini1969%, ppagl=ii. 


[2] Birkhoff, Garrett. Lattice Theory (2nd edition). American Mathemat- 


ical Society, New York, 1948. 


[3] Black, Max. ''Probability", The Margins of Precision, Cornell Univer- 
sity Press, Ithaca, 1970; reprinted from The Encyclopedia of 
Philosophy, Paul Edwards (ed.), Vol. 4, MacMillan, New York 1967, 
pow .L69=181% 


[4] Caratheodory, C. "Entwurf fur eine Algebraisierung des Integralbe- 
8 & 


griffs", S.B. bayer. Akad. Wiss (1938), pp. 27-69. 


oa Cramer, Harold. Mathematical Methods of Statistics, Princeton 


University Press, Princeton, 1946. 


[6] Csiszar, Imre. "Eine informationtheoretische Ungleichung und ihre 
Anwendung auf den Beweis der Ergodizitat von Markoffschen Ketten", 
A Magar Tudmanyos Akademia Matematikai Kutato Intezetenek 


Kozlemenyi, 8 (1963), pp. 85-108. 


[7] Csiszar, Imre. "Information type measures of difference of probab- 
ility distributions and indirect observations", Studia Scientiarum 


Mathematicum Hungarica 2(1967), pp. 299-318. 


[8] Csiszar, Imre. "On topological properties of f-divergences", ibid. 


pp. 329-339. 


[9] DOOD, eugene Stochastic Processes. Wiley, New York, 1953. 


anes Bie 


“a Wh 
ie 
< 
a 


wal lidedort i ates te, fiad:29&9 9dotexats 1s ago ao" i if) 


.gel7eY—Asqebue (46) In. 413 exadait A setegitt _ 
tick coq CORE | la ; . 


~jemoita onaleoma .. Cnoksths bul) ysoed? sozcded  .tdeT red sttotieke ¢ , 
-2h@L trot wat ,yaolso8 iso. = : 


~~avtatl Lfenved .aoletooxd Yo aptetaM oAT ,“qatitdadex?” xem »aonle Of 
Fo eibsqolovous si? moy? bogntrypar ;OFCL ,aosiial ,eeert, vote 
,THOY aoY wah paslittoom .) .Iov ,(.b9) ebrawba fost -viqowolidt 


L8I-@6L .aq 
| ea 
-sikeiasioi veh ghwrelatetdealé ants xv? Jauwind" .2 ,yrobesigemd — [) >. 
80S .qq. ¢ (BERL) wekW shed .reyed .8.2 ,“edtbay a 


nojgactz7 ,eotiatyes2 to abodsom issiznmsedIaM, bLorsi < Pauerxd 
CL ,potsootinT ,seerd yaierevint 


orth hoy grudytsign afsahs ssoendnul Jam otal ont4'  . atl , theted, 
."na320% ustloetioliaM cov te3isibogta-szsb stawed nsb is grub newer 
donszoseinl Ojs204 fsttsemoIeM sim beth aoyommbuT Tagalt A 


Pol-ea qq .(€ae!) 8 tease a 
- 


- @w' 7< 


-dedorg 1b soneTs3iLb lo asivessm 9gyd notsemsoini" -oxar ethaated (8) 
mires Fae plbusd , anottavrsade Jooikoni bas anol ndixserh a: 7 
-BLEUOS .qq , (TARE) S wo lingmull must snmedanet 


-bidt ,"asnnsgzovib-1 Yo asldreqorq Unolgoluqys #0 ree 


[10] 


[11] 


[12] 


[12] 


ae 


[15] 


[16] 


[17] 


[18] 


ae 


eye) 


Dynkin, E.B. Markov Processes, Academic Press, New York; Springer- 


Verlag, Berlin, 1965. 


Feinstein, Amiel. Foundations of Information Theory, McGraw-Hill, 


New York, 1958. 


Feller, William. An Introduction to Probability Theory and its 
Applications, Vol. 2, 2nd edition, Wiley, New York, 1971. 


Fisher, R.A. "On the mathematical foundation of theoretical statis- 
tics". Contributions to Mathematical Statistics, Wiley, New York 
1950, reprinted from Philosophical Transactions of the Royal 


Society of London, 222A (1922), pp. 309-368. 


Fisher, R.A. "Theory of statistical estimation", op. cit., reprinted 
from Proceedings of the Cambridge Philosophical Society, 22, p. 5, 
Ppp- (1925) pp. 700-725. 


Fisher, R.A. "The logic of inductive inference", op. cit., reprinted 
from Journal of the Royal Statistical Society, 98 pt. 1 (1935) 
pp. 39-54. 


Fréchet, Maurice. "Sur la notion de voisinage dans les ensembles 
abstraits", Bulletin des Sciences Mathematiques 42 (1918), pp. 138- 
MWe Loye 


Fréchet, Maurice. Les espaces abstraits. Gauthier-Villars, Paris, 


TES eat oN 


Ghurye, S.G. "Information and sufficient subfields", Annals of 


Mathematical Statistics, 39 (1968), pp. 2056-2066. 


Halmos, Paul R. ''The foundations of probability", American Mathema- 


tical Monthly, 51 (1944) pp. 497-510. 


=a 


«02 


: 7 
> ~ 
cnupuintel ;H10Y wot ,2a97T olmebash sueEagO7F veokvant ot vy 
8G watson gob 
~Llit-weaQoM ,ereonT polusemasinl to enobiabavel .LolmA satogeatot mee 
A221 tov wath 
: 
ait bas yoo yititdsdoad 07 nolzoubossol aA nrheiw ,2eiliet — 

LN@f .daoY woh pysliW (notatbs bak ,S =feV ,enolssotiqgséA 


-eloag2 I[nalaotesis io nolasbauokt festranediam siz 00" Ad ,sontekt ten] : 
20% walt .29lB .eottalisse lest samoitaaM 63 Haotsuditsao) »“eoks ; 
Lovo add Io enobszaereql Isstdqoeol tng moxt botatrqey ,OceL ; 
BOE-OOF .qq ,CSSO0L) ASS. .coband Jo yaelsoe ; 


Sontri9r ,.3io .qo ,"noltemrges Invisetisse Io yroedl™ Ae qredatt (on 
é .q .S8 .waston® Jootdqonolilt sgbirdadnd sf 20 egabbessord moz? - 


,@8h-00F agg (@8@8) aq ‘2 


y 


oe 
- . a 
intages ,.ito .qo ,"scnsr9tat avidoubnl 20. ofgol sit” Ad tedeRt [tk] 


— _ 


~~ is 


(ZEPL) I ty 82 .yjetoo® Insokseliss2 Iuvol eft to Lanzpot most 
-b2=RE Ce 


aeidmeets asl ensb sgeakeiov ab noljon al we" .9otrueM ,setlseyT | (anh “a 
@L «qq -(BICL) SS asupEsamodsaM agnneine. ebb nbislivd "eareraeds 


ae \. ts a 7 ie 
7? 


- 

,»ebhyed (a1efLiV=rsitisved sajistjede asonqes Bal -sobrya ,30d2377 [tr]. 
z = exes: ae iY 

= 7 ; 

‘0 alnanh ."eblsiidus 3neio)4 Toe ine nubsners oii! = e@. eer J ati | 

" $2 a : 


SCOR RNS 49 ore i Mid cite 
. , 


7 =) - e 
manda neohxamh sab Lldnderg 20° " mua} pa " 
a HEN Er ae ie 
8 , : bite 
nan a 7 


7 


[20] 


[21] 


[22] 


[23] 


[24] 


[25] 


[26] 


[27] 


[28] 


[29] 


[30] 


60. 


Halmos, Paul R. Measure Theory. Van Norstrand, Princeton, 1950. 


Hille, Einar. Functional Analysis and Semi-Groups, American Mathe- 


matical Society, 1948. 


Ingarden, R.S., and Urbanik, K. "Information without probability", 


Colloquium Mathematican, 9 (1962). 


Ingarden, R.S. "Simplified axioms for information without probab- 


ility", Prace Matematyczne, 9 (1965), pp. 273-282. 


Kampé de Fériet, Joseph et Forte, Bruno. "Information et probab- 
ilité", Comptes Rendus de 1'Academic des Sciences de Paris, 265 A 


(1967)% 


Kappos, Demetrios, A. Strukturtheorie der Wahrscheinlichkeitsfelder 


und raume, Springer-Verlag, Berlin (1960). 


Kappos, Demetrios, A. Probability Algebras and Stochastic Spaces, 
Academic Press, New York (1969). 


Kolmogorov, A.N., Gel'fand, I.M.,and Yaglom, A.M. "On the general 


definition of the amount of information" Doklady Akademii Nauk 


SSSR, 111 (1956), pp. 745-748. 


Kullback, S., and Leibler, S. '"Information and sufficiency", Annals 


of Mathematical Statistics, 22 (1951), pp. 79-86. 


Kullback, S. Information Theory and Statistics. Dover, New York, 


1968. 


Lee, P.M. "On the axioms of information theory", Annals of Mathema- 


tical Statistics: 35 (1964), pp. 414-441. 


09 : 
7 - _ - - > 
_ 7 re 
O80! ,norpontit ,basiaaxol asV .yrOsiT ‘e1ueReM ~~ 
get ninkyemk ,»quotd-ime? bos eleywlscA IsnoljonuT -t ont 4 
BROT eyI9T2028 tents wy he 


,“‘vakitdatioxq dvodiiw nolisaxoini" .4 ,Atasdyl) bap ,.2.0 ,asbtegol tgs 
(S805) @ ,necttomerizeM maldpoltod 7 
-dadord Jvodatw doliamrctot tot emolxs battiiqniz’ «8.4 er 
a i 

.SBS=255 .qq e(28UL) @ ,smsaysemesam scent . want 
~dadory y9 notismsotal™ Jommrr& .o17071 Jo Agsazol ,I9225% ob Squid { 
\ cao ,#tye% 9b avomeis? 2ob otimsbaoA'l ob eubash eeaqmod , Sarit 
. (Taek) = i 


vai letestetdstiniodseawisW 1b siiosdowwIseTgs = .A , aor tte .bogqad 
~-(OdUL) akixsd -galtoV—togabrge 4«amue2 bap 


(esoeqe pitasdso22 bon estisgih yattidedost .A ,2otstoned ,eeqqa’ 
(F900) AvoY wet ,seerT slmebeoA 


fersuog od9 op" (M.A ,@oigeyY BOR... bani’ led ,M.A .worogomion 
tue ifmebsts yboikod "nolsanoxrotat to Jqyons sd3 Bo wolskatteb 
-BOU-cal ogg) .(d@OK) DEL . 8282 
7 i 


ahead ."“yonetotitve bas aobsemrolal" @vaidred bas big load ttoX 8 <} 
-AB= Ot eng, CIEOK) ii casardbuabe oe, to i) 


i) 
etsor wet raved -potantze36° bas roost not samrnatay oe toed Lun ia 
As i ; -H02E : ’ _ 

. mT 


AF 


~acattaatt Yo oteanh jNpoad! aosannpiak das 
7 save ht == 4q we 0 eo 
= Pr 
wire. | nw 


J = iat om '~ 


[31] 


[32] 


[33] 


[34] 


[35] 


[36] 


[37] 


[38] 


[39] 


[40] 


lente 


Lindley, D.V. "On a measure of the information provided by an exper- 


iment", ibid. 27 (1956), pp. 986-1005. 


Meyer, Paul-André. Probability and Potentials, Blaisdell, Waltham 
1966, translationof Probabilités et potentiel, Hermann, Paris, 


1966. 


Meyer, Paul-André. Processus de Markov, Springer-Verlag, Berlin, 


(L967))- 


Parry, William, Entropy and Generators in Ergodic Theory, Benjamin, 


New York (1969). 


Rao, C. Rhadhakrishna. "Information and accuracy obtainable in an 
estimation of a statistical parameter", Bulletin of the Calcutta 


Mathematical Society, 37 (1945), pp. 81- 


Renyi, Alfred. "On measures of entropy and information". Fourth 
Berkeley Symposium on Mathematical Statistics and Probability, 1, 


University of California Press, Berkeley, 1961, pp. 547-561. 
Royden, H.L. Real Analysis (2nd edition), Macmillan, New York, 1968. 


Sakaguchi, Minoru. Information and Decision Making, George Washing- 


ton University, Washington, 1964. 


Shannon, Chaude E. The Mathematical Theory of Communication, (with 
an introductory article by Warren Weaver) University of Illinois 


Press, Urbana 1949, reprinted from Bell System Technical Journal. 


Wiener, Norbert. Cybernetics (2nd edition). MIT Press, Cambridge, 


Mass. 1961. 


naieleW jktebaleta ok bos yibitdadoxt 
sabvet -niism79t _ tetonss0q 40 e342 Ltdados4 40 


~rhisod ,gslisV-tsgnitqe , vols ob aveeosort 


- 
a 


mtiofesd ,¢rosdT obfegx% at erojatensd bao = mMebiliw ,varet 7 
7 


7 


_ 


ne ot sidertsido Gxiustse bun rotiemtoini" .cadatraodbedd 19,088 - [ze] 
ataesied sd) to nigslin® 782905784 isoljekiota s to nubieaises >. > on * 
<8 .qq (200i) SE .exekn0e tnoizemesm ane 


doxuod ."Goldamrrotat bes yqoazns Yo eaetiressm nO” -bertta ,ivasa Po a 
ol eguiitdedord baw aoljutter2 Inoitamedsal no muleoqmy? yeladszen- (ee 
102-082 .qq ,[O2L ,yslolvell azert Blozoliled Yo yalerevial © 


AOE lzoY wall ,ceittmoalt ,(tokatbhs bas) ateylons laok 1H ~mebyol a 


“given agudsd .guidasM noleltosd bins nuk suarraytl viaeanes cidougede? we : 
OGL. onbagokitans Vitessvial aoa 

tiw) aN AIST, do rom tna int ont .3 buss 

efonltLt to ¥% “(reves aotdell we sto itu ‘ero sbo 

-tapswots kevtodost eo W bode go 0aRT snd nord ‘s 


<aghitxtnad ea To gr mer cian jae we [oa 
= - ao an 7 
— Ae pe = a -¥ eee 
i | ’ a - i ‘ 


cst 


APPENDIX 1 


A convergence theorem: 


Let (2,F,P) be a probability space. 
Let tA} be a family of O-algebras increasing to AcF. 
Let £ be a function convex on [0,~°) , and let p be a non-negative 
integrable function. 
Let 5 = E(p|A_) and $= E(p|A) . Further, let TE fo on ; 
. ; EO art 
Then i, “hal eorand E(i) EG) 
Proof: to, ;n=1,...,°} is a martingale, and hence O17 Peo [28]. 
As convexity implies continuity; i, = as fi ee ee eee Le et forms 


a sub-martingale. Hence n < m => 
. < 2 < e 
E(i_) S EG) “cis 


Thus a) is increasing and bounded above, and converges to some 


number SeEG ) . Now E(i) = E(i) - (Gy) ; 


By Fatou's lemma, 


lim E(i1) > E(i") 
n 


so it only remains to prove that 


a ee 


lim E(i_) 
n 


a0 62s 


-sonqe yd ifadadory aod aca 


. 1 >A 9d gnleastsal ussipplune Yo: ether 0d uta if 
evivegsn-non a sd q Jot bre, Oj] oo xavnos nakonit & sd » 

e -nobionui ate 7 pint - 

bobs jf set eda (Alq)a = -) ( Alad3 = ae a ay 

. 3° CDs baa gt it sed? 

se # Pa ria 

7 

i 
-T6S) oe sored bna -slegnrsren sek {%)..0¢h = ae 7) = 

amvot faci Sel Hz pbb rt )yatoniiaos aghiqut ena a 


<=<m>s sons slagotaran~du 


7 - ; 


. (Hy > Cha = (2s 


witee oF asgxsynoo ban ,svoda- bebavod bos gatenstont - : ay a 
. Cis - C43 - (23 woK . rodrepo 


sito eget 
> rie ia 
= Cwas, ¢ ts ant 7 _ Tr 


“amity wont 
. i dt 
Gee ee in 


BP yi iets 


63. 


Let Xo be the largest number such that £(x,) = QQ. If there is no such 


number; then £ > 0 ; i, = 0 and the theorem is proved. Otherwise, let 


dx + B be the line of support of f at x, + iee., 


Ox, +B= 0 and PCr: xe 


Case 1: Suppose @>0. Let 8&f€x) -ax - B for x <x 


Oo 
= 0 EO Ce acat 
-— “oO 
Then g is convex, bounded and integrable. 
oe ES < ° 
*n f os —§& B 
hence by bounded convergence E(i,) > GUD) 
Case 2: Suppose 0) <= 0), ‘Leteye(x) = -Ox = 8 “for x => xX 
= 0 for ae 4 
O 


Then f£ < g 
i = ° < ° oa 
i £ p Eee O ae See 


n 


Hence, by dominated convergence Eis) > EG.) 


dows on at eres af «. O= Cx x)E said 
soi ysbtesedao!- chevory af coin ae ri 


waft “ x 34 5 ou 


. Fuh S (x)? bon 0 


xs x 70d O= 


.sldszgoint bos bebayod ,xevnoo Bt g 


af Sly" 2 


2 . | = 9a ae 
+ CLES: © (,t)3 suneysevens behaved yd 9: 


ako % so? hb —2h- = Ge jot . 070 saoqqwa 
" ae bare 0 = 
| 
: raw ea ae i= 4 
» (2t)d + (20S soaegtevaes bedantmob yd , 


APPENDIX 2 


A semi-group theorem: 


Let (E,E) be a measurable space. 
* 
Let ere be a semi-group of Markovian kernels on (E,E) and let Pe 


be the associated linear operators. 
Suppose that 


Pe(x,{x}) - 1 
A(x, Pop a tne Nacsa ee ee ed 


tO : 
and 
P(x,F) 
A, (x; F) = lim 
ae) 
* 
Ee =) Jb 
exist uniformly in x and Fe {x}° E . Then the limit Saar Gy ar 


exists uniformly as t ¥ 0 and A= A, a Ay is the kernel of this infin- 


itesimal operator. 


Proof: 


JP (x,dy) £(y)-£(x) 


- A = sup Fe 


as = sup = | A(x, dy) f(y) 
[[£[|=1 x 


Let us separate the integral into two parts: over {x} and over E - {x} 


Of course 


= "h4e= 


208Ge skderonson % ad aa" 
("o} self bas 4) fo eLeiraal aeivedtal to quoxg- hese. sd “ 


q 
laradazoqo yooalt beidboowes ef 
Jed3 saoqqve 
[- HnbePg) i | 
os 
(4,2) 3 
~ mrt = (4.x) r ‘s 
O+3 : 
pa 8 | so 
“> « simbt ota asd . 3 °fx} 4 bos x ab ylerodion 3 xe 
_ < eae 
‘tae 


_ 7 


“lint elds to Loewvan ot es i “+ oh =A boa 0% 3 es vlmrotinu 83 
: > 


_ (x)2-(¥)2 (xb,%) 5,1] 
- }u@itvexda | - Eons 


Sola zs 


» bx) -4 myo hoe {x} xevo 


fp (x,dy) £(y)-£(x) 
SEAS SEE eae ea 


Go. 


t 

J, 4B. (x,dy) £(y)-£(x) 

S | (Rh Et | AG,ay) £09) | 
{x} 
P(x, dy) f(y) 

+ | | page orca at | AGe,ay)£) | . 

E-{x} E-{x} 

Now, the first quantity in | | is merely 


P(x, {x}) Si 


- Ge, Gx) | eo 


E 
and as we are taking the sup over functions for which lee ohh 
eta 1. Thus we can make the first quantity < a: The second 
quantity in | | is less than or equal to 


P(x, dy) 
| a ~ AG,dy)| leon 
E-{x} 


and again as |£(y) | © I) we have the inteeral = 


P(x, E-{x}) 


: - A(x,E-{x}) 


[A 


sup 
Fe{x}°E 


because the convergence is uniform in both x and F. 


is proved. 


Thus the theorem 


(x yVie (vy? 
i) 


| 0a?» (ube) A \-= ah — x). 


| 2-001 
|e) (eb aA |- —— 
ix} 


CEE (Wb ex) 9 
3 JE (Yb HA ee | |+ 
lew (xb yx) sys 7 qv t 


7 


' 
visvem at | | ot ygtranup dew} edd quoi 


D ~ ([R}_x) 2 
. | carl | <td, =a - >. oe 


a= 
. L« ||2|| dokdw zo? anotzone) seve que sd3 gaidss exe ow es base 7 


hoossa sit e > witinéap 3e2i% sid soem nao ow eudt . 1 > | Gea . , 


a a 
o7 Lenps to anda sent eb | at yoijasup 


——) 
ee 
. (Yb .x),4 - 
jena | cena nenen unas | hho 
ix}~a 
5 teant 


> Inzgetnl sis sve ow 1 > | ey)2} BB ategs bas a 


9t 
(xc }rif ye) _ 
4 
a7 ixpo8 
masxesty afi audT . % bos x aed at motion zt sdamys0vn09 sf} 3 o 
{ ; : ¢ 


© 92 2) | Cel-Baga ~ 


a 


, , as _ 
cy fas ade de 
sis ae foes Ar ond _& 

> Pet oa : ae - 7 -_ 


cae ih 
ree rae oe Aaah 


rahi 
7 


